(k, q)-trace norm for sparse matrix factorization

Size: px

Start display at page:

Download "(k, q)-trace norm for sparse matrix factorization"

Jason Patterson
5 years ago
Views:

1 (k, q)-trace norm for sparse matrix factorization Emile Richard Department of Electrical Engineering Stanford University Guillaume Obozinski Imagine Ecole des Ponts ParisTech Jean-Philippe Vert CBIO, Mines ParisTech Institut Curie, INSERM Abstract We propose a new convex formulation for the sparse matrix factorization problem, where the number of nonzero elements of the factors is fixed. An example is the problem of sparse principal component analysis, which we reformulate as a proximal operator computation through the definition of new matrix norms that can be used more generally as regularizers for various optimization problems over matrices. We estimate the Gaussian complexity for the suggested norms and their vector analogues. Surprisingly, in the matrix case, knowing in advance the blocks size improves the statistical power of the norms by an order of magnitude compared to the usual l and trace norms, whereas in the vector case the norms built using the knowledge of the number of nonzeros have the same complexity as the l norm. Introduction A range of supervised problems such as link prediction in graphs containing community structure [7], phase retrieval [4] or dictionary learning [3] amount to solve sparse factorization problems, i.e., to infer a low-rank matrix that can be factorized as the product of two sparse matrices with few columns (left factor) and few rows (right factor). Such a factorization allows more efficient storage, faster computation, more interpretable solutions and especially leads to more accurate estimates in many situations. In the case of interaction networks, for example, this is related to the assumption that the network is organized as a collection of highly connected communities which can overlap. More generally, considering sparse low-rank matrices combines two natural forms of sparsity, in the spectrum and in the support, which can be motivated by the need to explain systems behaviors by a superposition of latent processes which only involve a few parameters. Landmark applications of sparse matrix factorization are sparse principal components analysis [8, ] or sparse canonical correlation analysis [], which are widely used to analyze high-dimensional data such as genomic data.

2 We are motivated by developing convex approaches to estimate sparse low-rank matrices. In order to properly define matrix factorization, given sparsity levels of the factors denoted by integers k and q, we first introduce below the (k, q)-rank of a matrix as the minimum number of left and right factors, having respectively k and q nonzeros, required to reconstruct a matrix. This index is a more involved complexity measure for matrices than the rank in that it conditions on the number of nonzero elements of the left and right factors of a matrix. Using this index, we propose two new atomic norms for matrices [6] by considering its convex hull restricted to the unit ball of the operator norm, and element-wise l, resulting in convex surrogates to low (k, q)-rank matrix estimation problem. We provide an equivalent formulation of the norms as nuclear norms in the sense of [], highlighting in particular a link to the k-support norm of []. We analyze the Gaussian complexity of the new norms and compare them to other classical penalties. Our analysis shows that while in the vector case the simple l norm provides similar statistical power than the relaxation obtained by knowing in advance the number of nonzeros of the unknown, in the matrix case one can gain a factor by considering the more evolved relaxations. Nonetheless, while in the vector case the computation remains feasible in polynomial time, the norms we introduce for matrices can not be evaluated in polynomial time. We propose algorithmic schemes to approximately learn with the new norms. The same norms and meta-algorithms can be used as a regularizer in supervised problems such as multitask learning or quadratic regression and phase retrieval, highlighting the fact that our algorithmic contribution does not consist in providing more efficient solutions to the rank- SPCA problem, but to combine atoms found by the rank- solvers in a principled way. We finally show experiments on real data using bilinear regression, demonstrating the flexibility of our norm in formulating various matrix valued problems.. Notations Throughout the text, m, m, k and q are integers representing respectively the size m m of the matrices we consider, and the number of non-zero components in left (k) and right (q) factors when we have a lowrank matrix. Given matrices A and B of the same size, A, B = tr(a B) is the standard inner product of matrices. For any matrix Z R m m the notations Z 0, Z, Z, Z, Z op and rank(z) stand respectively for the number of nonzeros, entry-wise l and l norms, the trace-norm (or nuclear norm, the sum of the singular values), the operator norm (the largest singular value) and the rank of Z. supp(z) {,, m } {,, m } is the set of indices of nonzero elements of Z. When dealing with a matrix Z whose nonzero elements form a block, supp(z) takes the form I J where I {, m } and J {, m }. G k (resp. G q ) is the set of subset of k indices in {,, m } (resp. q indices in {,, m }). For a matrix Z and two subsets of indices I {, m } and J {, m }, Z I,J is the matrix having the same entries as Z inside the index subset I J, and 0 entries outside. Tight convex relaxations of sparse factorization constraints In this section we propose two new matrix norms helpful to define convex formulations of various sparse matrix factorization problems. We start by defining the (k, q)-rank of a matrix in Section., a useful generalization of the rank which also quantifies the sparseness of a matrix factorization. We then introduce two atomic norms defined as tight convex relaxations of (k, q)-rank: the (k, q)-trace norm (Section.), obtained by relaxing the (k, q)-rank over the operator norm ball, and the (k, q)-cut norm (also Section.), obtained by relaxing the (k, q)-rank over the element-wise l ball. We finally relate these matrix norms to vector norms using the concept of nuclear norms, due to[], linking in particular the (k, q)-trace norm for matrices to the k-support norm of [].

3 . The (k, q)-rank of a matrix The rank of a matrix Z is the minimum number of rank- matrices (i.e., outer products of of vectors of the form ab ) needed to express Z as a linear combination Z = r i= a ib i. It is a versatile concept, central to define matrix factorization problems and low-rank approximations. The following definition generalizes this notion to incorporate conditions on the sparseness of the rank- elements: Definition ((k, q)-svd and (k, q)-rank) For a matrix Z R m m, we call a decomposition of the form Z = r i= c ia i b i with c c c r 0, a i b i A k,q and with minimal r a (k, q)-sparse singular value decomposition of Z, or (k, q)-svd. We refer to vectors a i and b i as the left and right sparse singular vectors of Z, to c i as its (k, q)-sparse singular values, and to r as its (k, q)-rank. The (k, q)-svd is not necessarily unique and its factors are in general not orthogonal (see the appendix for some counter-examples). The non-orthogonality of factors is a major difference with the usual SVD, with important algorithmic consequences. This suggests that the idea of enforcing orthogonality between the sparse factors as done in previous work [, 0], would not be a good idea in our context. In particular, the example we study in the appendix does not have any sparse SVD with orthogonal factors. Note that by definition rank(z) = (m, m )-rank(z) and Z 0 = (, )-rank(z), so that we have the tight inequalities: rank(z) (k, q)-rank(z) Z 0. The (k, q)-rank is useful to formalize problems such as sparse matrix factorization, which can be defined as approximating the solution of a matrix valued problem by a matrix having low (k, q)-rank. For instance the standard rank- SPCA problem consists in finding the matrix of k-rank = approximating best the sample covariance matrix []. Unfortunately, the (k, q)-rank is a nonconvex index, like the rank or the cardinality, leading to computational difficulties when one wants to estimate matrices with small (k, q)-rank. In the sequel, we therefore define and study convex relaxations of the (k, q)-rank that could replace it to quantify the sparseness and low-rankness of matrices.. Convex relaxations of the (k, q)-rank In order to define convex relaxations of the (k, q)-rank, it is useful to recall the definition of atomic norms as defined by [6]: Definition (atomic norm) Given a centrally symmetric compact subset A R p of elements called atoms, the atomic norm induced by A on R p is the gauge function [9, p. 8] of A: x A = inf {t > 0 : x tconv(a)}. () [6] show that the atomic norm is indeed a norm, which can be rewritten as: { } x A = inf c a a c a 0 a A, () and whose dual norm satisfies x A a A c a : x = a A = sup { x, z : z A } = sup { x, a : a A}. We can now propose our first convex relaxation of the (k, q)-rank as an atomic norm induced by a particular set of atoms: (3) 3

4 Definition 3 ((k, q)-trace norm) The (k, q)-trace norm Ω k,q (Z) is the atomic norm induced by the following set of atoms: A k,q = {ab : a R m, b R m, a 0 k, b 0 q, a = b = }. (4) In other words, A k,q is the set of matrices Z R m m such that (k, q)-rank(z) = and Z op = ; the (k, q)-trace norm is therefore the convex relaxation of the (k, q)-rank over the operator norm unit ball. Besides the characterization of Ω k,q (Z) as a gauge function (), the following lemma (whose proof is postponed to the Appendix) provides an alternative characterization of Ω k,q and its dual: Lemma For any Z R m m we have Ω k,q (Z) = inf Z (I,J) : Z = Z (I,J), supp(z (I,J) ) I J, (5) (I,J) G k G q (I,J) and Ω k,q(z) = max { Z I,J op : I G k, J G q }. (6) Our second norm is obtained by focusing on a more restricted set of atoms, motivated by applications where in addition to being sparse and low-rank, we want to estimate matrices which are constant over blocks: Definition 4 ((k, q)-cut norm) The (k, q)-cut norm Z (k,q)-cut is the atomic norm induced by the following set of atoms: a R m, a 0 = k, i supp(a) a i = A 0 k k,q = ab : b R m, b 0 = q, j supp(b) b j =. (7) q In other words, the atoms in A 0 k,q are the atoms of A k,q where the nonzero elements of A have all the same amplitude; the (k, q)-cut norm is therefore the convex relaxation of the (k, q)-rank over the l norm unit ball. The name of it was chosen by analogy with the CUT polytope of [9]..3 Equivalent nuclear norms built upon vector norms Let us now consider a different approach to define convex matrix norms, by building nuclear norms for matrices []. These constructions can be seen as a key to write alternate minimization algorithms. For that purpose it is useful to recall the following general definition of nuclear norms and the characterization of corresponding dual norms, taken from [, Propositions.9 and.]: Proposition (nuclear norm) Let α and β denote any vector norms on R m and R m, respectively, then { ν(z) = inf a i α b i β : Z = } a i b i, i i where the infimum is taken over all summations of finite length, is a norm over R m m called the nuclear norm induced by α and β. Its dual is given by ν (Z) = sup {a Zb : a α, b β }. (8) 4

5 L k support κ k Figure : Unit balls edges of 3 norms of interest in the vector p = 3 dimensional case where the number of nonzero elements is fixed to k =. Vertices of the (, )-CUT polytope constitute the A 0, set. The following result shows that the nuclear norm induced by two atomic norms (Definition ) is itself an atomic norm. The proof is straightforward and results from writing the dual norm thanks to 8 and observing that it is equal to the dual of the corresponding atomic norm using the expression.. Lemma If α and β are two atomic norms on R m and R m induced respectively by two atom sets A and A, then the atomic norm on R m m induced by α and β is an atomic norm induced by the atom set: A = { ab } : a A, b A. For instance if α and β are taken to be the l norm for vectors, which is the atomic norm induced by unit-norm one-sparse vectors {±e i }, the resulting nuclear norm is simply the element-wise l norm of the matrix, induced by atoms {±e i,j }. We now show that the (k, q)-trace norm and the (k, q)-cut norm are both nuclear norms, induced by specific vector norms: Theorem ) The (k, q)-trace norm is the nuclear norm induced by θ k and θ q, the k- and q-support norms of [] given by ( k r p ) θ k (w) = w (i) + w (i), (9) r + i= i=k r where w (i) denotes the ith largest in absolute value entry of w, r {0,, k } is the unique integer such that w (k r ) > p r+ i=k r w (i) w (k r), and where by convention w (0) =. ) The (k, q)-cut norm is the nuclear norm induced by the vector norms κ k ( ) and κ q ( ) defined by κ k (w) = ( max w, ) k k w. (0) 5

6 .4 Subdifferential of the (k, q)-trace norm on atoms The following lemma provides an explicit description of the subdifferential of Ω k,q at an atom A = ab A k,q, which will be useful to study the statistical properties of this norm as a regularizer in the next section. Lemma 3 The subdifferential of Ω k,q at A A k,q is Ω k,q (A) = { A + Z : AZ I 0,J 0 = 0, A Z I0,J 0 = 0, (I, J) G k G q A I,J + Z I,J op }. () The characterization of the subdifferential of the norm at a point is important to understand how easily the point can be recovered using the norm as a penalty. A geometric intuition to this fact can be obtained by looking at the normal cone, the conic hull of the subdifferential. The larger the normal cone at a point is, the more tractable the estimation of the point is in statistical terms when the norm is used as a convex penalty. By comparing the dimension of the subspace in which Z lives in Lemma 3, namely m m k q + with the analogue subspace in case of the trace norm {Z : AZ = 0, A Z = 0} which has dimension m m m m +, and for the l norm (disjoint support matrices) that has dimension m m kq one expects this norm to have a much larger subdifferential, resulting in better statistical performance. In the Section 3 we bring a quantitative support to this observation..5 Positive semidefinite matrices The case of the decomposition of a symmetric positive semidefinite matrix Σ deserves a specific discussion. It might indeed be desirable to obtain a decomposition of the form Σ = r i= c ia i a i with c i in addition possibly constrained to be non-negative. This is easily obtained by replacing the set of atoms by A k, = {aa, a A k }, or A k,sym = {±aa a A k } and by constructing the corresponding atomic norms, which are very similar to the ones we presented. It should be noted to be accurate that the set A k, is not centrally symmetric and does not induce a norm but simply a gauge, which does not change the claims of this work which only rely on convexity of the gauge. When using the corresponding gauges as regularizer, the gauge of A k, will have the advantage of guaranteeing that the solution of the problem is p.s.d., which might not be the case for the other atomic norms..6 Using Ω k,q as a regularizer We suggest to use the (k, q)-trace norm in a number of applications as a regularizer. This means that once the matrix estimation problem is formulated as the minimization of a convex loss l : R m m R, we consider solving the penalized problem of minimizing l(z) + λω k,q (Z) for a parameter λ > 0. The simplest loss to consider is the squared Euclidean distance from an observation Y : l(z) = Z Y F, which can be used for denoising an observation Y = Z + G where G stands for the noise and Z for the matrix we want to estimate. An important instance of such problem is the sparse PCA problem, where based on the sample covariance matrix ˆΣ n, one aims at solving successively rank- problems of finding the leading sparse eigenvector max z ˆΣn z s.th. z =, z 0 k, (rank- SPCA) z and switching to the next component using some heuristic []. Rather than solving a sequence of problems, we formulate the problem of estimating r principal components having each k nonzero elements as { } min ˆΣ n Z F : k-rank(z) r and Z 0, () and relax the nonconvex problem () to the following convex optimization problem that can be interpreted as projecting onto the portion of the unit ball of Ω k inside the PSD cone, or as computing the proximal map 6

7 of Ω k : { } min ˆΣ n Z F + λω k (Z) : Z 0. Another example of loss function comes from linear regression where an observation y = X (Z ) + ɛ is available with X : R m m R n linear and ɛ modeling the noise process. We may use the least-squared loss l(z) = X (Z) y to estimate Z. Bilinear regression is an example where the linear map X is given by X (Z) = vec(x ZX ), X R n m, X R n m being design matrices. A particular instance of bilinear regression is quadratic regression, which under PSD constraint Z 0 is closely related to phase retrieval [4]. In this setup, for a design matrix X R n p, we define X (Z) = diag(xzx ). We expect Z to be sparse, motivated by the small number of active variables in the data and also assume that the organization of the features interaction in an even smaller number of latent groups makes the true parameter Z low-rank. 3 Statistical properties of suggested regularizers As discussed in the application the norms we introduce are designed to be used as regularizers in various supervised and unsupervised problems. We briefly recall a set of relevant statistical guarantee results that can be stated based on computing the expected dual norm of a noise process and the statistical dimension of the regularizer at particular points. In Section 3. we study the denoising problem. We compare the minimum mean squared errors of denoising a single spike corrupted with obtained using different regularizers. In Section 3. we focus on computing the statistical dimension of objects of interest both in matrix and vector cases and compare the asymptotic rates, which sheds the light on the power of the norms we study when used as convex penalties. 3. Minimum square error in denoising Knowing the expected value of dual norm of noise process is informative for denoising. Consider the denoising problem where Y = Z + G is observed and we aim at estimating Z knowing that it has low (k, q)-rank. Following [3] Theorem we know that when the parameter λ is chosen large enough λ E Ω k,q G), the proximal or soft-thresholding operator associated with Ω k,q, namely Ẑ = arg min Z R m m Z Y + λω k,q (Z) satisfies E Ẑ Z F λ Ω k,q(z ). For instance if the model is Y = Z +G with G R m m having iid gaussian entries drawn from N (0, σ ), we can compute the expected value of the dual norm, which bounds the squared dual norm E Ω k,q (G). Lemma 4 By taking the expectation of the dual norm over matrices G R m m with entries drawn iid from N (0, ), we have ( E Ω k,q(g) k log m k + k/4 + q log m ) q + q/4 (3) more classical results are similar quantities for l and operator norms: E G op m + m and E G log (m m ). As a consequence, under centered gaussian noise with variance σ, we have 7

8 Matrix norm (k, q)-trace trace l Minimum Square Error k log m k + q log m q m + m kq log(m m ) Min. S. E. for k = m m /4 log m m m log m Table : Various norms minimum square error in denoising an atom ab A 0 k,q corrupted with unit variance gaussian noise. The column k = m corresponds to the order of magnitudes in the regime where m = m = m and k = q = m. Theorem Let G R m m with entries drawn iid from N (0, ) and Y = Z + σ G a noisy observation. ( The estimator Ẑ = arg min Z Z Y F + λω k,q(z), where λ = k ) σω k,q (Z ) log m k + k/4 + q log m q + q/4 satisfies ( E Ẑ Z F σ Ω k,q (Z ) k log m k + k/4 + q log m ) q + q/4. Similar estimators built using the l norm, namely Z = arg min Z Z Y F + λ Z and Z = arg min Z Z Y F + λ Z where λ = σ kq log(m m ) and λ = σ( m + m ) satisty E Z Z F σ Z log(m m ) and E Z Z F σ Z ( m + m ) A fundamental example is the so-called single spike model where the observation consists of an atom with equal in magnitude entries ab A 0 k,q corrupted with additive noise Y = + σg. In this situation one ab may want to compare denoising with Ω k,q with denoising using the trace norm or the l norm. We obtain, thanks to ab = kq kq = kq, Ω k,q (ab ) = ab = the following Corollary When Z = ab A k,q is an atom, the estimators defined in Theorem satisfy ( E Ẑ ab F σ k log m k + k/4 + q log m ) q + q/4, and E Z ab F σ kq log(m m ) and E Z ab F σ( m + m ) This is a first observation exhibiting the power of this regularizer, and interesting orders of magnitudes are put together in Table to make the comparison easy. Note that in case ab A k,q gets far from A 0 k,q elements, for instance when getting close to e e, then the bound gets tighter for the l norm and reaches σ log(m m ) on e e while not changing for the two other norms. 3. The statistical dimension of a convex function at a given point Interesting tools have recently been used by [6, 4,, 0] to quantify the statistical power of a convex nonsmooth regularizer used as a constraint or penalty. The gaussian complexity term has a nice geometric interpretation: it quantifies the width of the cone of descent directions locally. Interestingly, this quantity is related to sample complexity terms for compressed sensing, signal denoising and demixing applications. For conciseness, we refer to the statistical dimension of a function at a point to the quantity that [] call the statistical dimension of the tangent cone associated with the function at the given point. This quantity equals 8

9 (up to an additive constant ) to the squared gaussian width of the tangent cone intersected with the unit euclidean sphere used by [6], see [] for technical details. To give a concise definition of what we refer to as the statistical dimension, let for convex function f, N f (w) denote the normal cone of f at point w: that is the conic hull of the subdifferential. Equivalently one can define the normal cone as being the polar cone of the set of descent directions, also called the tangent cone. We denote the statistical dimension of a convex function f : R p R at w R p by S(w, f) = E dist(g, N f (w)) where dist denotes the euclidean distance of the gaussian vector g N (0, I p ) from the normal cone N f (w) of f at point w. To motivate our particular focus on computing this specific complexity term, we recall a non-exhaustive list of results formulated using the statistical dimension. Exact recovery. Having observed y = Xw with X R n p having iid entries drawn from N (0, /n). The solution to min f(w) s.th. Xw = y w equals w with overwhelming probability as soon as n S(w, f), see [6, Corollary 3.3]. In addition [] proved (Theorem II) that the phase transition is centered at S(w, f) with a width not exceeding p in magnitude. Robust recovery. Given a corrupted observation y = Xw + ɛ where the corruption ɛ R n is assumed to be bounded ɛ δ, then ŵ = arg min f(w) s.th. Xw y δ w satisfies ŵ w δ/η with overwhelming probability as soon as n (S(w, f) + 3 )/( η), see [6, Corollary 3.3]. Denoising. Assume a collection of noisy observations x i = w + σɛ i for i =,, n is available where ɛ i N (0, I p ) are iid, and let y = n n i= x i denote their average. [5] prove in Proposition 4 that ŵ = arg min w y s.th. f(w) f(w ) w satisfies E ŵ w σ n S(w, f). Demixing. Given y = Uw + v with U R p p orthogonal and w, v R p, [] proved in Theorem III that the pair of (ŵ, ˆv) solutions of min f(w) s.th. g(v) g(v ) and y = Uw + v equals (ŵ, ˆv) with probability η provided that S(w, f) + S(v, g) p 4 p log 4 η. Conversely if S(w, f) + S(v, g) p + 4 p log 4 η the demixing succeeds with probability η. 3.3 Statistical dimensions of suggested regularizers We will focus on computing the statistical dimension at atoms of the suggested norms. This has two principal motivations. The first is to derive a number of potential corollaries from our results for potential applications of this regularizer. The second is the comparison of the statistical dimension of the suggested norm to those of related functions. The comparison provides a better understanding of the geometry of the problem and 9

10 locates it next to better known objects such as the trace norm or the CUT polytope. The analogy with the vector case is particularly surprising: the knowledge of the number of nonzeros does not allow to gain an order of magnitude in the vector case, whereas the gain is significant in the matrix case, see Section 3.4. Let us start with a very result showing that the (k, q)-trace norm is at least as statistically powerful as any convex combination of l and trace norm. By restriction to the PSD cone this result shows statistical superiority of the suggested norm to the more standard SDP relaxations of the sparse PCA problem [7]. Lemma 5 Let A B and. A and. B denote the gauges associated with convex hulls of corresponding sets. For any z A, we have. the inclusion of normal cones N A (z) N B (z). for any g R p, dist(g, N A (z)) dist(g, N B (z)) 3. S(z,. A ) S(z,. B ) The following result shows Ω k,q is always at least as good as the standard l + trace norm penalty popular to estimate sparse low-rank matrices [8] on A 0 k,q. Proposition Fix a pair of real numbers λ, ν > 0 and let f denote the function Z λ Z + ν Z. The normal cone to Ω k,q at A = ab A 0 k,q contains the normal cone to f and is contained in the normal cone of the (k, q)-cut polytope at A: N f (A) N Ωk,q (A) N (k,q)-cut (A). The proof can be found in Appendix. We recall the following lower bound of order of kq (m +m ) on the statistical dimension of the sum of l plus trace norm before moving to state upper bounds on (k, q)-trace and (k, q)-cu T norms statistical dimensions which improve on the previous by achieving asymptotical rates of order k log m + q log m. Proposition 3 ([5] Theorem 3.3) Let λ, µ > 0 and consider ν : Z λ Z + µ Z. Then for every A A k,q, there exists a constant C > 0 such that the upper bound S(A, ν) C (kq (m + m )) can not be improved in terms of scaling in k, q, m, m. Before turning on to more advanced results let us introduce the following coefficient that quantifies how close a point ab A k,q is to the subset A 0 k,q. Definition 5 (Dispersion coefficient) Let A = ab define the dispersion coefficient γ(a, b) (0, 4 ] as A k,q and let I 0 = supp(a) and J 0 = supp(b). We γ(a, b) = 4 min ( k a i I i, q b ) j, 0 j J 0 the maximal value 4 being reached for vectors a and b having nonzero entries of the same amplitude k and q respectively. 0

11 Proposition 4 Let A = ab A k,q, and let γ = γ(a, b) denote their dispersion coefficient. The statistical dimension of Ω k,q on A is bounded by ( 6 (k + q) ) ( k log m γ γ k + q log m ) q Corollary In case A A 0 k,q, we get γ = 4, so statistical dimension is bounded by a factor of the order k log m + q log m, which fits the rates obtained for the (k, q)-cut norm (see Proposition 6) up to k log k + q log q k log m + q log m. In the vector case, thanks to the closed form expression of the k-support norm (see 9), we can obtain the following bound which has the same asymptotic rate with smaller constants than the upper bound obtained by simply setting m = q = in Proposition 4. Proposition 5 The statistical dimension of the k-support norm at a s-sparse vector w R p is bounded by { ( ) } 5 4 k + θ k (w) (r + ) s i=k r w log p 5 ( ) /3 θ k (w) (r + ) (i) k 3 s i=k r w (i) (p k) p, where w (i) denotes the i-th entry of w sorted in descending order of absolute values. θ k (w) s i=k r w (i) The term (r + ) ). We distinguish 4 regimes: is the vector analogous of the dispersion coefficient γ(a, b) (see Definition 5. In case w A k, \A 0 k, is k-sparse with entries not equal in absolute value, r = 0 and θ k(w) =. Therefore (r + ) min i I0 θ k (w) s i=k r w (i) = min i I0 S(w, sp k ) { 5 4 k + min i I0 w i, the bound becomes w i log p k } 5 ( 3 (p k) min i I0 w i ) /3 p.. In the most favorable case, when all the s = k nonzero elements take the same absolute value w A 0 k,, then r = k takes its maximum value. In this case the norm is proportional to the l norm and we bound the statistical dimension by 5 4k + k log(p/k) which is exactly equal to the statistical dimension of the l norm on a k-sparse point. Note that the bound is continuous on atoms. For w A 0 k, we had a dispersion coefficient w which tends to i k as w gets close to A 0 k,q, and this fits 5 4k + k log(p/k). Substituting m = q = in Proposition 6 on the (k, q)-cut polytope, we get a term of the same order of magnitude with a larger constant, which we believe is due to the proof techniques used in [6] Theorem 3.9 and Corollary 3.4. The order of magnitude however can not be improved and this ensures the tightness of the bound up to a multiplicative factor. On Figure one can distinguish the points having coordinates with the same absolute value on the intersection of the three unit ball edges, represented in green red and blue. The tightness between the l and k-support norm tangent cones on A 0 k,q can be clearly understood in this Figure. 3. An intermediary regime exists where s i=k r w (i) /θ k (w) is of the order of p. In this case we get ( θ the asymptotic behavior in O ( k (w) s i=k r w p) ). /3 (i) 4. In the other extreme case which is the the less favorable, when r = 0 and ( s i=k r w (i) ) /θ k (w) much larger than p, the k-support norm is proportional to the l norm in a neighborhood of the point. This situation happens at points close to canonical elements, e for instance, where the norm is differentiable. In these cases the gaussian width is trivially bounded by p.

12 Matrix norm Statistical dimension k = m Vector norm Statistical dimension (k, q)-trace k log m + q log m m log m k-support k log p k (k, q)-cut k log m m k + q log q m log m κk k log p k l kq log mm kq m log m l k log p k Trace m + m m l p l + trace kq (m + m ) m elastic net k log p k CUT m + m m + m l p Table : Order of magnitude of various norms statistical dimension on atoms located on the subset A 0 k,q. The l norm for matrices is the element-wise l norm and the trace norm in the vector case (p = m, m = ) reduces to the l norm. The column k = m corresponds to the order of magnitudes in the case of the planted clique problem where m = m = m and the clique size is set to k = q = m. We now study the polyhedral functions defined by using the gauge of A 0 k,q we named (k, q)-cut norm. Proposition 6 The statistical dimension of the (k, q)-cut norm on elements of A ( 0 k,q 9 k log m m k + q log q ). + (k + q)( + log ) is bounded by An elementary corollary of Proposition 6 provides the statistical dimension of κ k by setting q = : a A 0 k,, S(a, κ k ) 9k log p + 9( + log )k. k In order to compare this polytope with the CUT polytope see e.g. [9] we note that any matrix A = ab A 0 k,q lies on 4-facets of the polytope kqcut where CUT = conv{ab, a {±} m, b {±} m }. We can state the following result on the statistical dimension of this norm, obtained by using the same argument as above and by noticing that CUT has m+m vertices. Proposition 7 The statistical dimension of the CUT norm on A 0 k,q is bounded by O(m + m ). 3.4 Comparison of norms properties The study of the vector case completes the comparison with a surprising message: the construction of the (k, q)-trace norm leads to a significant improvement towards the simple rivals (l and trace norm), but in the vector case the k-support norm does not lead to a significant benefit in the vector case. The same holds for the polyhedral norms defined through the (k, q)-cut and κ k norms. In fact the bound obtained in Proposition 5 fits the l -norm order of magnitude k log p k. This was in fact expected (and also tight) as even the norm induced by the gauge of the convex hull of A 0 k, in the vector case, namely the (k, )-CUT norm leads to the same order of magnitude k log p k. In contrast the analogous tight relaxation in the vector case does not lead to an improvement in the statistical dimension rates. We point out that taking k = m as in the planted clique setting we obtain a statistical dimension of m log(m) which is small compared to the O(m) rate obtained using the CUT polytope or also the trace norm, or the m log(m) rates using lifted trace norms [6]. In Table we can find the statistical dimensions of different norms on the most favorable case of A 0 k,q atoms. A representation in the vector case of the edges of the atom sets can be found in Figure.

13 In the Figure the A 0 k, elements are at the intersection of the 3 represented edges and one clearly sees that Tangent cones coincide on these points, which gives an intuition of the same order of magnitude found for the statistical dimensions. A last interesting remark is that all the norms except the element-wise l -norm have closed-form expressions in terms of the vector elements but the matrix norms only have variational expressions. This is due to a basic fact: ordering vectors elements is possible, but not matrix elements. Therefore computing the dual norm is feasible only in the element-wise l norm for the matrix norms. References [] D. Amelunxen, M. Lotz, M. Mccoy, and J. Tropp. Living on the edge: a geometric theory of phase transition in convex optimization. submitted, 03. [] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k support norm. NIPS, 0. [3] B.N. Bhaskar, G. Tang, and B. Recht. Atomic norm denoising with applications to line spectral estimation. Preprint, 0. [4] E.J. Candès, T Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 0. [5] V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. Proceedings of the National Academy of Sciences,, 03. [6] V. Chandrasekaran, B. Recht, P.A. Parrilo, and A.S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics,, 0. [7] A. d Aspremont, F. Bach, and L. ElGhaoui. Optimal solutions for sparse principal component analysis. The Journal of Machine Learning Research, 9, 008. [8] A. d Aspremont, L. El Ghaoui, M.I. Jordan, and G.R.G. Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 007. [9] M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springer, 996. [0] R. Foygel and L. Mackey. Corrupted sensing: Novel guarantees for separating structured signals. Preprint, 03. [] G.J.O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Cambridge University Press, 987. [] L. Mackey. Deflation methods for sparse pca. NIPS, 008. [3] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 00. [4] S. Oymak and B. Hassibi. On a relation between the minimax denoising and the phase transitions of convex functions. in preparation, 0. [5] S. Oymak, A. Jalali, M. Fazel, Y. Eldar, and B. Hassibi. Simultaneously structured models with application to sparse and low-rank matrices. submitted, 0. [6] E. Richard, F. Bach, and J.-P. Vert. Intersecting singularities for multi-structured estimation. ICML, 03. [7] E. Richard, S. Gaiffas, and N. Vayatis. Link prediction in graphs with autoregressive features. NIPS, 0. [8] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low-rank matrices. In Proceeding of 9th Annual International Conference on Machine Learning, 0. [9] R.T. Rockafellar. Convex Analysis. Princeton Univ. Press,

14 [0] V. Vu, J. Cho, J. Lei, and K. Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. NIPS, 03. [] D. M Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 009. [] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics,

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.