(k, q)-trace norm for sparse matrix factorization

Size: px
Start display at page:

Download "(k, q)-trace norm for sparse matrix factorization"

Transcription

1 (k, q)-trace norm for sparse matrix factorization Emile Richard Department of Electrical Engineering Stanford University Guillaume Obozinski Imagine Ecole des Ponts ParisTech Jean-Philippe Vert CBIO, Mines ParisTech Institut Curie, INSERM Abstract We propose a new convex formulation for the sparse matrix factorization problem, where the number of nonzero elements of the factors is fixed. An example is the problem of sparse principal component analysis, which we reformulate as a proximal operator computation through the definition of new matrix norms that can be used more generally as regularizers for various optimization problems over matrices. We estimate the Gaussian complexity for the suggested norms and their vector analogues. Surprisingly, in the matrix case, knowing in advance the blocks size improves the statistical power of the norms by an order of magnitude compared to the usual l and trace norms, whereas in the vector case the norms built using the knowledge of the number of nonzeros have the same complexity as the l norm. Introduction A range of supervised problems such as link prediction in graphs containing community structure [7], phase retrieval [4] or dictionary learning [3] amount to solve sparse factorization problems, i.e., to infer a low-rank matrix that can be factorized as the product of two sparse matrices with few columns (left factor) and few rows (right factor). Such a factorization allows more efficient storage, faster computation, more interpretable solutions and especially leads to more accurate estimates in many situations. In the case of interaction networks, for example, this is related to the assumption that the network is organized as a collection of highly connected communities which can overlap. More generally, considering sparse low-rank matrices combines two natural forms of sparsity, in the spectrum and in the support, which can be motivated by the need to explain systems behaviors by a superposition of latent processes which only involve a few parameters. Landmark applications of sparse matrix factorization are sparse principal components analysis [8, ] or sparse canonical correlation analysis [], which are widely used to analyze high-dimensional data such as genomic data.

2 We are motivated by developing convex approaches to estimate sparse low-rank matrices. In order to properly define matrix factorization, given sparsity levels of the factors denoted by integers k and q, we first introduce below the (k, q)-rank of a matrix as the minimum number of left and right factors, having respectively k and q nonzeros, required to reconstruct a matrix. This index is a more involved complexity measure for matrices than the rank in that it conditions on the number of nonzero elements of the left and right factors of a matrix. Using this index, we propose two new atomic norms for matrices [6] by considering its convex hull restricted to the unit ball of the operator norm, and element-wise l, resulting in convex surrogates to low (k, q)-rank matrix estimation problem. We provide an equivalent formulation of the norms as nuclear norms in the sense of [], highlighting in particular a link to the k-support norm of []. We analyze the Gaussian complexity of the new norms and compare them to other classical penalties. Our analysis shows that while in the vector case the simple l norm provides similar statistical power than the relaxation obtained by knowing in advance the number of nonzeros of the unknown, in the matrix case one can gain a factor by considering the more evolved relaxations. Nonetheless, while in the vector case the computation remains feasible in polynomial time, the norms we introduce for matrices can not be evaluated in polynomial time. We propose algorithmic schemes to approximately learn with the new norms. The same norms and meta-algorithms can be used as a regularizer in supervised problems such as multitask learning or quadratic regression and phase retrieval, highlighting the fact that our algorithmic contribution does not consist in providing more efficient solutions to the rank- SPCA problem, but to combine atoms found by the rank- solvers in a principled way. We finally show experiments on real data using bilinear regression, demonstrating the flexibility of our norm in formulating various matrix valued problems.. Notations Throughout the text, m, m, k and q are integers representing respectively the size m m of the matrices we consider, and the number of non-zero components in left (k) and right (q) factors when we have a lowrank matrix. Given matrices A and B of the same size, A, B = tr(a B) is the standard inner product of matrices. For any matrix Z R m m the notations Z 0, Z, Z, Z, Z op and rank(z) stand respectively for the number of nonzeros, entry-wise l and l norms, the trace-norm (or nuclear norm, the sum of the singular values), the operator norm (the largest singular value) and the rank of Z. supp(z) {,, m } {,, m } is the set of indices of nonzero elements of Z. When dealing with a matrix Z whose nonzero elements form a block, supp(z) takes the form I J where I {, m } and J {, m }. G k (resp. G q ) is the set of subset of k indices in {,, m } (resp. q indices in {,, m }). For a matrix Z and two subsets of indices I {, m } and J {, m }, Z I,J is the matrix having the same entries as Z inside the index subset I J, and 0 entries outside. Tight convex relaxations of sparse factorization constraints In this section we propose two new matrix norms helpful to define convex formulations of various sparse matrix factorization problems. We start by defining the (k, q)-rank of a matrix in Section., a useful generalization of the rank which also quantifies the sparseness of a matrix factorization. We then introduce two atomic norms defined as tight convex relaxations of (k, q)-rank: the (k, q)-trace norm (Section.), obtained by relaxing the (k, q)-rank over the operator norm ball, and the (k, q)-cut norm (also Section.), obtained by relaxing the (k, q)-rank over the element-wise l ball. We finally relate these matrix norms to vector norms using the concept of nuclear norms, due to[], linking in particular the (k, q)-trace norm for matrices to the k-support norm of [].

3 . The (k, q)-rank of a matrix The rank of a matrix Z is the minimum number of rank- matrices (i.e., outer products of of vectors of the form ab ) needed to express Z as a linear combination Z = r i= a ib i. It is a versatile concept, central to define matrix factorization problems and low-rank approximations. The following definition generalizes this notion to incorporate conditions on the sparseness of the rank- elements: Definition ((k, q)-svd and (k, q)-rank) For a matrix Z R m m, we call a decomposition of the form Z = r i= c ia i b i with c c c r 0, a i b i A k,q and with minimal r a (k, q)-sparse singular value decomposition of Z, or (k, q)-svd. We refer to vectors a i and b i as the left and right sparse singular vectors of Z, to c i as its (k, q)-sparse singular values, and to r as its (k, q)-rank. The (k, q)-svd is not necessarily unique and its factors are in general not orthogonal (see the appendix for some counter-examples). The non-orthogonality of factors is a major difference with the usual SVD, with important algorithmic consequences. This suggests that the idea of enforcing orthogonality between the sparse factors as done in previous work [, 0], would not be a good idea in our context. In particular, the example we study in the appendix does not have any sparse SVD with orthogonal factors. Note that by definition rank(z) = (m, m )-rank(z) and Z 0 = (, )-rank(z), so that we have the tight inequalities: rank(z) (k, q)-rank(z) Z 0. The (k, q)-rank is useful to formalize problems such as sparse matrix factorization, which can be defined as approximating the solution of a matrix valued problem by a matrix having low (k, q)-rank. For instance the standard rank- SPCA problem consists in finding the matrix of k-rank = approximating best the sample covariance matrix []. Unfortunately, the (k, q)-rank is a nonconvex index, like the rank or the cardinality, leading to computational difficulties when one wants to estimate matrices with small (k, q)-rank. In the sequel, we therefore define and study convex relaxations of the (k, q)-rank that could replace it to quantify the sparseness and low-rankness of matrices.. Convex relaxations of the (k, q)-rank In order to define convex relaxations of the (k, q)-rank, it is useful to recall the definition of atomic norms as defined by [6]: Definition (atomic norm) Given a centrally symmetric compact subset A R p of elements called atoms, the atomic norm induced by A on R p is the gauge function [9, p. 8] of A: x A = inf {t > 0 : x tconv(a)}. () [6] show that the atomic norm is indeed a norm, which can be rewritten as: { } x A = inf c a a c a 0 a A, () and whose dual norm satisfies x A a A c a : x = a A = sup { x, z : z A } = sup { x, a : a A}. We can now propose our first convex relaxation of the (k, q)-rank as an atomic norm induced by a particular set of atoms: (3) 3

4 Definition 3 ((k, q)-trace norm) The (k, q)-trace norm Ω k,q (Z) is the atomic norm induced by the following set of atoms: A k,q = {ab : a R m, b R m, a 0 k, b 0 q, a = b = }. (4) In other words, A k,q is the set of matrices Z R m m such that (k, q)-rank(z) = and Z op = ; the (k, q)-trace norm is therefore the convex relaxation of the (k, q)-rank over the operator norm unit ball. Besides the characterization of Ω k,q (Z) as a gauge function (), the following lemma (whose proof is postponed to the Appendix) provides an alternative characterization of Ω k,q and its dual: Lemma For any Z R m m we have Ω k,q (Z) = inf Z (I,J) : Z = Z (I,J), supp(z (I,J) ) I J, (5) (I,J) G k G q (I,J) and Ω k,q(z) = max { Z I,J op : I G k, J G q }. (6) Our second norm is obtained by focusing on a more restricted set of atoms, motivated by applications where in addition to being sparse and low-rank, we want to estimate matrices which are constant over blocks: Definition 4 ((k, q)-cut norm) The (k, q)-cut norm Z (k,q)-cut is the atomic norm induced by the following set of atoms: a R m, a 0 = k, i supp(a) a i = A 0 k k,q = ab : b R m, b 0 = q, j supp(b) b j =. (7) q In other words, the atoms in A 0 k,q are the atoms of A k,q where the nonzero elements of A have all the same amplitude; the (k, q)-cut norm is therefore the convex relaxation of the (k, q)-rank over the l norm unit ball. The name of it was chosen by analogy with the CUT polytope of [9]..3 Equivalent nuclear norms built upon vector norms Let us now consider a different approach to define convex matrix norms, by building nuclear norms for matrices []. These constructions can be seen as a key to write alternate minimization algorithms. For that purpose it is useful to recall the following general definition of nuclear norms and the characterization of corresponding dual norms, taken from [, Propositions.9 and.]: Proposition (nuclear norm) Let α and β denote any vector norms on R m and R m, respectively, then { ν(z) = inf a i α b i β : Z = } a i b i, i i where the infimum is taken over all summations of finite length, is a norm over R m m called the nuclear norm induced by α and β. Its dual is given by ν (Z) = sup {a Zb : a α, b β }. (8) 4

5 L k support κ k Figure : Unit balls edges of 3 norms of interest in the vector p = 3 dimensional case where the number of nonzero elements is fixed to k =. Vertices of the (, )-CUT polytope constitute the A 0, set. The following result shows that the nuclear norm induced by two atomic norms (Definition ) is itself an atomic norm. The proof is straightforward and results from writing the dual norm thanks to 8 and observing that it is equal to the dual of the corresponding atomic norm using the expression.. Lemma If α and β are two atomic norms on R m and R m induced respectively by two atom sets A and A, then the atomic norm on R m m induced by α and β is an atomic norm induced by the atom set: A = { ab } : a A, b A. For instance if α and β are taken to be the l norm for vectors, which is the atomic norm induced by unit-norm one-sparse vectors {±e i }, the resulting nuclear norm is simply the element-wise l norm of the matrix, induced by atoms {±e i,j }. We now show that the (k, q)-trace norm and the (k, q)-cut norm are both nuclear norms, induced by specific vector norms: Theorem ) The (k, q)-trace norm is the nuclear norm induced by θ k and θ q, the k- and q-support norms of [] given by ( k r p ) θ k (w) = w (i) + w (i), (9) r + i= i=k r where w (i) denotes the ith largest in absolute value entry of w, r {0,, k } is the unique integer such that w (k r ) > p r+ i=k r w (i) w (k r), and where by convention w (0) =. ) The (k, q)-cut norm is the nuclear norm induced by the vector norms κ k ( ) and κ q ( ) defined by κ k (w) = ( max w, ) k k w. (0) 5

6 .4 Subdifferential of the (k, q)-trace norm on atoms The following lemma provides an explicit description of the subdifferential of Ω k,q at an atom A = ab A k,q, which will be useful to study the statistical properties of this norm as a regularizer in the next section. Lemma 3 The subdifferential of Ω k,q at A A k,q is Ω k,q (A) = { A + Z : AZ I 0,J 0 = 0, A Z I0,J 0 = 0, (I, J) G k G q A I,J + Z I,J op }. () The characterization of the subdifferential of the norm at a point is important to understand how easily the point can be recovered using the norm as a penalty. A geometric intuition to this fact can be obtained by looking at the normal cone, the conic hull of the subdifferential. The larger the normal cone at a point is, the more tractable the estimation of the point is in statistical terms when the norm is used as a convex penalty. By comparing the dimension of the subspace in which Z lives in Lemma 3, namely m m k q + with the analogue subspace in case of the trace norm {Z : AZ = 0, A Z = 0} which has dimension m m m m +, and for the l norm (disjoint support matrices) that has dimension m m kq one expects this norm to have a much larger subdifferential, resulting in better statistical performance. In the Section 3 we bring a quantitative support to this observation..5 Positive semidefinite matrices The case of the decomposition of a symmetric positive semidefinite matrix Σ deserves a specific discussion. It might indeed be desirable to obtain a decomposition of the form Σ = r i= c ia i a i with c i in addition possibly constrained to be non-negative. This is easily obtained by replacing the set of atoms by A k, = {aa, a A k }, or A k,sym = {±aa a A k } and by constructing the corresponding atomic norms, which are very similar to the ones we presented. It should be noted to be accurate that the set A k, is not centrally symmetric and does not induce a norm but simply a gauge, which does not change the claims of this work which only rely on convexity of the gauge. When using the corresponding gauges as regularizer, the gauge of A k, will have the advantage of guaranteeing that the solution of the problem is p.s.d., which might not be the case for the other atomic norms..6 Using Ω k,q as a regularizer We suggest to use the (k, q)-trace norm in a number of applications as a regularizer. This means that once the matrix estimation problem is formulated as the minimization of a convex loss l : R m m R, we consider solving the penalized problem of minimizing l(z) + λω k,q (Z) for a parameter λ > 0. The simplest loss to consider is the squared Euclidean distance from an observation Y : l(z) = Z Y F, which can be used for denoising an observation Y = Z + G where G stands for the noise and Z for the matrix we want to estimate. An important instance of such problem is the sparse PCA problem, where based on the sample covariance matrix ˆΣ n, one aims at solving successively rank- problems of finding the leading sparse eigenvector max z ˆΣn z s.th. z =, z 0 k, (rank- SPCA) z and switching to the next component using some heuristic []. Rather than solving a sequence of problems, we formulate the problem of estimating r principal components having each k nonzero elements as { } min ˆΣ n Z F : k-rank(z) r and Z 0, () and relax the nonconvex problem () to the following convex optimization problem that can be interpreted as projecting onto the portion of the unit ball of Ω k inside the PSD cone, or as computing the proximal map 6

7 of Ω k : { } min ˆΣ n Z F + λω k (Z) : Z 0. Another example of loss function comes from linear regression where an observation y = X (Z ) + ɛ is available with X : R m m R n linear and ɛ modeling the noise process. We may use the least-squared loss l(z) = X (Z) y to estimate Z. Bilinear regression is an example where the linear map X is given by X (Z) = vec(x ZX ), X R n m, X R n m being design matrices. A particular instance of bilinear regression is quadratic regression, which under PSD constraint Z 0 is closely related to phase retrieval [4]. In this setup, for a design matrix X R n p, we define X (Z) = diag(xzx ). We expect Z to be sparse, motivated by the small number of active variables in the data and also assume that the organization of the features interaction in an even smaller number of latent groups makes the true parameter Z low-rank. 3 Statistical properties of suggested regularizers As discussed in the application the norms we introduce are designed to be used as regularizers in various supervised and unsupervised problems. We briefly recall a set of relevant statistical guarantee results that can be stated based on computing the expected dual norm of a noise process and the statistical dimension of the regularizer at particular points. In Section 3. we study the denoising problem. We compare the minimum mean squared errors of denoising a single spike corrupted with obtained using different regularizers. In Section 3. we focus on computing the statistical dimension of objects of interest both in matrix and vector cases and compare the asymptotic rates, which sheds the light on the power of the norms we study when used as convex penalties. 3. Minimum square error in denoising Knowing the expected value of dual norm of noise process is informative for denoising. Consider the denoising problem where Y = Z + G is observed and we aim at estimating Z knowing that it has low (k, q)-rank. Following [3] Theorem we know that when the parameter λ is chosen large enough λ E Ω k,q G), the proximal or soft-thresholding operator associated with Ω k,q, namely Ẑ = arg min Z R m m Z Y + λω k,q (Z) satisfies E Ẑ Z F λ Ω k,q(z ). For instance if the model is Y = Z +G with G R m m having iid gaussian entries drawn from N (0, σ ), we can compute the expected value of the dual norm, which bounds the squared dual norm E Ω k,q (G). Lemma 4 By taking the expectation of the dual norm over matrices G R m m with entries drawn iid from N (0, ), we have ( E Ω k,q(g) k log m k + k/4 + q log m ) q + q/4 (3) more classical results are similar quantities for l and operator norms: E G op m + m and E G log (m m ). As a consequence, under centered gaussian noise with variance σ, we have 7

8 Matrix norm (k, q)-trace trace l Minimum Square Error k log m k + q log m q m + m kq log(m m ) Min. S. E. for k = m m /4 log m m m log m Table : Various norms minimum square error in denoising an atom ab A 0 k,q corrupted with unit variance gaussian noise. The column k = m corresponds to the order of magnitudes in the regime where m = m = m and k = q = m. Theorem Let G R m m with entries drawn iid from N (0, ) and Y = Z + σ G a noisy observation. ( The estimator Ẑ = arg min Z Z Y F + λω k,q(z), where λ = k ) σω k,q (Z ) log m k + k/4 + q log m q + q/4 satisfies ( E Ẑ Z F σ Ω k,q (Z ) k log m k + k/4 + q log m ) q + q/4. Similar estimators built using the l norm, namely Z = arg min Z Z Y F + λ Z and Z = arg min Z Z Y F + λ Z where λ = σ kq log(m m ) and λ = σ( m + m ) satisty E Z Z F σ Z log(m m ) and E Z Z F σ Z ( m + m ) A fundamental example is the so-called single spike model where the observation consists of an atom with equal in magnitude entries ab A 0 k,q corrupted with additive noise Y = + σg. In this situation one ab may want to compare denoising with Ω k,q with denoising using the trace norm or the l norm. We obtain, thanks to ab = kq kq = kq, Ω k,q (ab ) = ab = the following Corollary When Z = ab A k,q is an atom, the estimators defined in Theorem satisfy ( E Ẑ ab F σ k log m k + k/4 + q log m ) q + q/4, and E Z ab F σ kq log(m m ) and E Z ab F σ( m + m ) This is a first observation exhibiting the power of this regularizer, and interesting orders of magnitudes are put together in Table to make the comparison easy. Note that in case ab A k,q gets far from A 0 k,q elements, for instance when getting close to e e, then the bound gets tighter for the l norm and reaches σ log(m m ) on e e while not changing for the two other norms. 3. The statistical dimension of a convex function at a given point Interesting tools have recently been used by [6, 4,, 0] to quantify the statistical power of a convex nonsmooth regularizer used as a constraint or penalty. The gaussian complexity term has a nice geometric interpretation: it quantifies the width of the cone of descent directions locally. Interestingly, this quantity is related to sample complexity terms for compressed sensing, signal denoising and demixing applications. For conciseness, we refer to the statistical dimension of a function at a point to the quantity that [] call the statistical dimension of the tangent cone associated with the function at the given point. This quantity equals 8

9 (up to an additive constant ) to the squared gaussian width of the tangent cone intersected with the unit euclidean sphere used by [6], see [] for technical details. To give a concise definition of what we refer to as the statistical dimension, let for convex function f, N f (w) denote the normal cone of f at point w: that is the conic hull of the subdifferential. Equivalently one can define the normal cone as being the polar cone of the set of descent directions, also called the tangent cone. We denote the statistical dimension of a convex function f : R p R at w R p by S(w, f) = E dist(g, N f (w)) where dist denotes the euclidean distance of the gaussian vector g N (0, I p ) from the normal cone N f (w) of f at point w. To motivate our particular focus on computing this specific complexity term, we recall a non-exhaustive list of results formulated using the statistical dimension. Exact recovery. Having observed y = Xw with X R n p having iid entries drawn from N (0, /n). The solution to min f(w) s.th. Xw = y w equals w with overwhelming probability as soon as n S(w, f), see [6, Corollary 3.3]. In addition [] proved (Theorem II) that the phase transition is centered at S(w, f) with a width not exceeding p in magnitude. Robust recovery. Given a corrupted observation y = Xw + ɛ where the corruption ɛ R n is assumed to be bounded ɛ δ, then ŵ = arg min f(w) s.th. Xw y δ w satisfies ŵ w δ/η with overwhelming probability as soon as n (S(w, f) + 3 )/( η), see [6, Corollary 3.3]. Denoising. Assume a collection of noisy observations x i = w + σɛ i for i =,, n is available where ɛ i N (0, I p ) are iid, and let y = n n i= x i denote their average. [5] prove in Proposition 4 that ŵ = arg min w y s.th. f(w) f(w ) w satisfies E ŵ w σ n S(w, f). Demixing. Given y = Uw + v with U R p p orthogonal and w, v R p, [] proved in Theorem III that the pair of (ŵ, ˆv) solutions of min f(w) s.th. g(v) g(v ) and y = Uw + v equals (ŵ, ˆv) with probability η provided that S(w, f) + S(v, g) p 4 p log 4 η. Conversely if S(w, f) + S(v, g) p + 4 p log 4 η the demixing succeeds with probability η. 3.3 Statistical dimensions of suggested regularizers We will focus on computing the statistical dimension at atoms of the suggested norms. This has two principal motivations. The first is to derive a number of potential corollaries from our results for potential applications of this regularizer. The second is the comparison of the statistical dimension of the suggested norm to those of related functions. The comparison provides a better understanding of the geometry of the problem and 9

10 locates it next to better known objects such as the trace norm or the CUT polytope. The analogy with the vector case is particularly surprising: the knowledge of the number of nonzeros does not allow to gain an order of magnitude in the vector case, whereas the gain is significant in the matrix case, see Section 3.4. Let us start with a very result showing that the (k, q)-trace norm is at least as statistically powerful as any convex combination of l and trace norm. By restriction to the PSD cone this result shows statistical superiority of the suggested norm to the more standard SDP relaxations of the sparse PCA problem [7]. Lemma 5 Let A B and. A and. B denote the gauges associated with convex hulls of corresponding sets. For any z A, we have. the inclusion of normal cones N A (z) N B (z). for any g R p, dist(g, N A (z)) dist(g, N B (z)) 3. S(z,. A ) S(z,. B ) The following result shows Ω k,q is always at least as good as the standard l + trace norm penalty popular to estimate sparse low-rank matrices [8] on A 0 k,q. Proposition Fix a pair of real numbers λ, ν > 0 and let f denote the function Z λ Z + ν Z. The normal cone to Ω k,q at A = ab A 0 k,q contains the normal cone to f and is contained in the normal cone of the (k, q)-cut polytope at A: N f (A) N Ωk,q (A) N (k,q)-cut (A). The proof can be found in Appendix. We recall the following lower bound of order of kq (m +m ) on the statistical dimension of the sum of l plus trace norm before moving to state upper bounds on (k, q)-trace and (k, q)-cu T norms statistical dimensions which improve on the previous by achieving asymptotical rates of order k log m + q log m. Proposition 3 ([5] Theorem 3.3) Let λ, µ > 0 and consider ν : Z λ Z + µ Z. Then for every A A k,q, there exists a constant C > 0 such that the upper bound S(A, ν) C (kq (m + m )) can not be improved in terms of scaling in k, q, m, m. Before turning on to more advanced results let us introduce the following coefficient that quantifies how close a point ab A k,q is to the subset A 0 k,q. Definition 5 (Dispersion coefficient) Let A = ab define the dispersion coefficient γ(a, b) (0, 4 ] as A k,q and let I 0 = supp(a) and J 0 = supp(b). We γ(a, b) = 4 min ( k a i I i, q b ) j, 0 j J 0 the maximal value 4 being reached for vectors a and b having nonzero entries of the same amplitude k and q respectively. 0

11 Proposition 4 Let A = ab A k,q, and let γ = γ(a, b) denote their dispersion coefficient. The statistical dimension of Ω k,q on A is bounded by ( 6 (k + q) ) ( k log m γ γ k + q log m ) q Corollary In case A A 0 k,q, we get γ = 4, so statistical dimension is bounded by a factor of the order k log m + q log m, which fits the rates obtained for the (k, q)-cut norm (see Proposition 6) up to k log k + q log q k log m + q log m. In the vector case, thanks to the closed form expression of the k-support norm (see 9), we can obtain the following bound which has the same asymptotic rate with smaller constants than the upper bound obtained by simply setting m = q = in Proposition 4. Proposition 5 The statistical dimension of the k-support norm at a s-sparse vector w R p is bounded by { ( ) } 5 4 k + θ k (w) (r + ) s i=k r w log p 5 ( ) /3 θ k (w) (r + ) (i) k 3 s i=k r w (i) (p k) p, where w (i) denotes the i-th entry of w sorted in descending order of absolute values. θ k (w) s i=k r w (i) The term (r + ) ). We distinguish 4 regimes: is the vector analogous of the dispersion coefficient γ(a, b) (see Definition 5. In case w A k, \A 0 k, is k-sparse with entries not equal in absolute value, r = 0 and θ k(w) =. Therefore (r + ) min i I0 θ k (w) s i=k r w (i) = min i I0 S(w, sp k ) { 5 4 k + min i I0 w i, the bound becomes w i log p k } 5 ( 3 (p k) min i I0 w i ) /3 p.. In the most favorable case, when all the s = k nonzero elements take the same absolute value w A 0 k,, then r = k takes its maximum value. In this case the norm is proportional to the l norm and we bound the statistical dimension by 5 4k + k log(p/k) which is exactly equal to the statistical dimension of the l norm on a k-sparse point. Note that the bound is continuous on atoms. For w A 0 k, we had a dispersion coefficient w which tends to i k as w gets close to A 0 k,q, and this fits 5 4k + k log(p/k). Substituting m = q = in Proposition 6 on the (k, q)-cut polytope, we get a term of the same order of magnitude with a larger constant, which we believe is due to the proof techniques used in [6] Theorem 3.9 and Corollary 3.4. The order of magnitude however can not be improved and this ensures the tightness of the bound up to a multiplicative factor. On Figure one can distinguish the points having coordinates with the same absolute value on the intersection of the three unit ball edges, represented in green red and blue. The tightness between the l and k-support norm tangent cones on A 0 k,q can be clearly understood in this Figure. 3. An intermediary regime exists where s i=k r w (i) /θ k (w) is of the order of p. In this case we get ( θ the asymptotic behavior in O ( k (w) s i=k r w p) ). /3 (i) 4. In the other extreme case which is the the less favorable, when r = 0 and ( s i=k r w (i) ) /θ k (w) much larger than p, the k-support norm is proportional to the l norm in a neighborhood of the point. This situation happens at points close to canonical elements, e for instance, where the norm is differentiable. In these cases the gaussian width is trivially bounded by p.

12 Matrix norm Statistical dimension k = m Vector norm Statistical dimension (k, q)-trace k log m + q log m m log m k-support k log p k (k, q)-cut k log m m k + q log q m log m κk k log p k l kq log mm kq m log m l k log p k Trace m + m m l p l + trace kq (m + m ) m elastic net k log p k CUT m + m m + m l p Table : Order of magnitude of various norms statistical dimension on atoms located on the subset A 0 k,q. The l norm for matrices is the element-wise l norm and the trace norm in the vector case (p = m, m = ) reduces to the l norm. The column k = m corresponds to the order of magnitudes in the case of the planted clique problem where m = m = m and the clique size is set to k = q = m. We now study the polyhedral functions defined by using the gauge of A 0 k,q we named (k, q)-cut norm. Proposition 6 The statistical dimension of the (k, q)-cut norm on elements of A ( 0 k,q 9 k log m m k + q log q ). + (k + q)( + log ) is bounded by An elementary corollary of Proposition 6 provides the statistical dimension of κ k by setting q = : a A 0 k,, S(a, κ k ) 9k log p + 9( + log )k. k In order to compare this polytope with the CUT polytope see e.g. [9] we note that any matrix A = ab A 0 k,q lies on 4-facets of the polytope kqcut where CUT = conv{ab, a {±} m, b {±} m }. We can state the following result on the statistical dimension of this norm, obtained by using the same argument as above and by noticing that CUT has m+m vertices. Proposition 7 The statistical dimension of the CUT norm on A 0 k,q is bounded by O(m + m ). 3.4 Comparison of norms properties The study of the vector case completes the comparison with a surprising message: the construction of the (k, q)-trace norm leads to a significant improvement towards the simple rivals (l and trace norm), but in the vector case the k-support norm does not lead to a significant benefit in the vector case. The same holds for the polyhedral norms defined through the (k, q)-cut and κ k norms. In fact the bound obtained in Proposition 5 fits the l -norm order of magnitude k log p k. This was in fact expected (and also tight) as even the norm induced by the gauge of the convex hull of A 0 k, in the vector case, namely the (k, )-CUT norm leads to the same order of magnitude k log p k. In contrast the analogous tight relaxation in the vector case does not lead to an improvement in the statistical dimension rates. We point out that taking k = m as in the planted clique setting we obtain a statistical dimension of m log(m) which is small compared to the O(m) rate obtained using the CUT polytope or also the trace norm, or the m log(m) rates using lifted trace norms [6]. In Table we can find the statistical dimensions of different norms on the most favorable case of A 0 k,q atoms. A representation in the vector case of the edges of the atom sets can be found in Figure.

13 In the Figure the A 0 k, elements are at the intersection of the 3 represented edges and one clearly sees that Tangent cones coincide on these points, which gives an intuition of the same order of magnitude found for the statistical dimensions. A last interesting remark is that all the norms except the element-wise l -norm have closed-form expressions in terms of the vector elements but the matrix norms only have variational expressions. This is due to a basic fact: ordering vectors elements is possible, but not matrix elements. Therefore computing the dual norm is feasible only in the element-wise l norm for the matrix norms. References [] D. Amelunxen, M. Lotz, M. Mccoy, and J. Tropp. Living on the edge: a geometric theory of phase transition in convex optimization. submitted, 03. [] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k support norm. NIPS, 0. [3] B.N. Bhaskar, G. Tang, and B. Recht. Atomic norm denoising with applications to line spectral estimation. Preprint, 0. [4] E.J. Candès, T Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 0. [5] V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. Proceedings of the National Academy of Sciences,, 03. [6] V. Chandrasekaran, B. Recht, P.A. Parrilo, and A.S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics,, 0. [7] A. d Aspremont, F. Bach, and L. ElGhaoui. Optimal solutions for sparse principal component analysis. The Journal of Machine Learning Research, 9, 008. [8] A. d Aspremont, L. El Ghaoui, M.I. Jordan, and G.R.G. Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 007. [9] M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springer, 996. [0] R. Foygel and L. Mackey. Corrupted sensing: Novel guarantees for separating structured signals. Preprint, 03. [] G.J.O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Cambridge University Press, 987. [] L. Mackey. Deflation methods for sparse pca. NIPS, 008. [3] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 00. [4] S. Oymak and B. Hassibi. On a relation between the minimax denoising and the phase transitions of convex functions. in preparation, 0. [5] S. Oymak, A. Jalali, M. Fazel, Y. Eldar, and B. Hassibi. Simultaneously structured models with application to sparse and low-rank matrices. submitted, 0. [6] E. Richard, F. Bach, and J.-P. Vert. Intersecting singularities for multi-structured estimation. ICML, 03. [7] E. Richard, S. Gaiffas, and N. Vayatis. Link prediction in graphs with autoregressive features. NIPS, 0. [8] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low-rank matrices. In Proceeding of 9th Annual International Conference on Machine Learning, 0. [9] R.T. Rockafellar. Convex Analysis. Princeton Univ. Press,

14 [0] V. Vu, J. Cho, J. Lei, and K. Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. NIPS, 03. [] D. M Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 009. [] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics,

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Convex sets, conic matrix factorizations and conic rank lower bounds

Convex sets, conic matrix factorizations and conic rank lower bounds Convex sets, conic matrix factorizations and conic rank lower bounds Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute

More information

Optimisation Combinatoire et Convexe.

Optimisation Combinatoire et Convexe. Optimisation Combinatoire et Convexe. Low complexity models, l 1 penalties. A. d Aspremont. M1 ENS. 1/36 Today Sparsity, low complexity models. l 1 -recovery results: three approaches. Extensions: matrix

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

The Convex Geometry of Linear Inverse Problems

The Convex Geometry of Linear Inverse Problems Found Comput Math 2012) 12:805 849 DOI 10.1007/s10208-012-9135-7 The Convex Geometry of Linear Inverse Problems Venkat Chandrasekaran Benjamin Recht Pablo A. Parrilo Alan S. Willsky Received: 2 December

More information

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12 STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Ronny Luss Optimization and

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Generalized Power Method for Sparse Principal Component Analysis

Generalized Power Method for Sparse Principal Component Analysis Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work

More information

COMPRESSED Sensing (CS) is a method to recover a

COMPRESSED Sensing (CS) is a method to recover a 1 Sample Complexity of Total Variation Minimization Sajad Daei, Farzan Haddadi, Arash Amini Abstract This work considers the use of Total Variation (TV) minimization in the recovery of a given gradient

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

CSC 576: Variants of Sparse Learning

CSC 576: Variants of Sparse Learning CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in

More information

Low-Rank Matrix Recovery

Low-Rank Matrix Recovery ELE 538B: Mathematics of High-Dimensional Data Low-Rank Matrix Recovery Yuxin Chen Princeton University, Fall 2018 Outline Motivation Problem setup Nuclear norm minimization RIP and low-rank matrix recovery

More information

Corrupted Sensing: Novel Guarantees for Separating Structured Signals

Corrupted Sensing: Novel Guarantees for Separating Structured Signals 1 Corrupted Sensing: Novel Guarantees for Separating Structured Signals Rina Foygel and Lester Mackey Department of Statistics, Stanford University arxiv:130554v csit 4 Feb 014 Abstract We study the problem

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

Recovery of Simultaneously Structured Models using Convex Optimization

Recovery of Simultaneously Structured Models using Convex Optimization Recovery of Simultaneously Structured Models using Convex Optimization Maryam Fazel University of Washington Joint work with: Amin Jalali (UW), Samet Oymak and Babak Hassibi (Caltech) Yonina Eldar (Technion)

More information

Composite Loss Functions and Multivariate Regression; Sparse PCA

Composite Loss Functions and Multivariate Regression; Sparse PCA Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.

More information

Lecture Notes 9: Constrained Optimization

Lecture Notes 9: Constrained Optimization Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form

More information

arxiv: v1 [math.st] 10 Sep 2015

arxiv: v1 [math.st] 10 Sep 2015 Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and

More information

The Convex Geometry of Linear Inverse Problems

The Convex Geometry of Linear Inverse Problems The Convex Geometry of Linear Inverse Problems Venkat Chandrasekaran m, Benjamin Recht w, Pablo A. Parrilo m, and Alan S. Willsky m m Laboratory for Information and Decision Systems Department of Electrical

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

The convex algebraic geometry of linear inverse problems

The convex algebraic geometry of linear inverse problems The convex algebraic geometry of linear inverse problems The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

On the Convex Geometry of Weighted Nuclear Norm Minimization

On the Convex Geometry of Weighted Nuclear Norm Minimization On the Convex Geometry of Weighted Nuclear Norm Minimization Seyedroohollah Hosseini roohollah.hosseini@outlook.com Abstract- Low-rank matrix approximation, which aims to construct a low-rank matrix from

More information

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning Short Course Robust Optimization and Machine Machine Lecture 4: Optimization in Unsupervised Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 s

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

The convex algebraic geometry of rank minimization

The convex algebraic geometry of rank minimization The convex algebraic geometry of rank minimization Pablo A. Parrilo Laboratory for Information and Decision Systems Massachusetts Institute of Technology International Symposium on Mathematical Programming

More information

From Compressed Sensing to Matrix Completion and Beyond. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison

From Compressed Sensing to Matrix Completion and Beyond. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison From Compressed Sensing to Matrix Completion and Beyond Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison Netflix Prize One million big ones! Given 100 million ratings on a

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Analysis of Robust PCA via Local Incoherence

Analysis of Robust PCA via Local Incoherence Analysis of Robust PCA via Local Incoherence Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yi Zhou Department of EECS Syracuse University Syracuse, NY 3244 yzhou35@syr.edu

More information

Sparse and Low-Rank Matrix Decompositions

Sparse and Low-Rank Matrix Decompositions Forty-Seventh Annual Allerton Conference Allerton House, UIUC, Illinois, USA September 30 - October 2, 2009 Sparse and Low-Rank Matrix Decompositions Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo,

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

ROBUST BLIND SPIKES DECONVOLUTION. Yuejie Chi. Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 43210

ROBUST BLIND SPIKES DECONVOLUTION. Yuejie Chi. Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 43210 ROBUST BLIND SPIKES DECONVOLUTION Yuejie Chi Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 4 ABSTRACT Blind spikes deconvolution, or blind super-resolution, deals with

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Sparse PCA with applications in finance

Sparse PCA with applications in finance Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction

More information

arxiv: v3 [cs.it] 25 Jul 2014

arxiv: v3 [cs.it] 25 Jul 2014 Simultaneously Structured Models with Application to Sparse and Low-rank Matrices Samet Oymak c, Amin Jalali w, Maryam Fazel w, Yonina C. Eldar t, Babak Hassibi c arxiv:11.3753v3 [cs.it] 5 Jul 014 c California

More information

arxiv: v3 [math.oc] 19 Oct 2017

arxiv: v3 [math.oc] 19 Oct 2017 Gradient descent with nonconvex constraints: local concavity determines convergence Rina Foygel Barber and Wooseok Ha arxiv:703.07755v3 [math.oc] 9 Oct 207 0.7.7 Abstract Many problems in high-dimensional

More information

Gaussian Phase Transitions and Conic Intrinsic Volumes: Steining the Steiner formula

Gaussian Phase Transitions and Conic Intrinsic Volumes: Steining the Steiner formula Gaussian Phase Transitions and Conic Intrinsic Volumes: Steining the Steiner formula Larry Goldstein, University of Southern California Nourdin GIoVAnNi Peccati Luxembourg University University British

More information

Learning with Submodular Functions: A Convex Optimization Perspective

Learning with Submodular Functions: A Convex Optimization Perspective Foundations and Trends R in Machine Learning Vol. 6, No. 2-3 (2013) 145 373 c 2013 F. Bach DOI: 10.1561/2200000039 Learning with Submodular Functions: A Convex Optimization Perspective Francis Bach INRIA

More information

Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery

Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery Department of Mathematics & Risk Management Institute National University of Singapore (Based on a joint work with Shujun

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

arxiv: v1 [math.oc] 11 Jun 2009

arxiv: v1 [math.oc] 11 Jun 2009 RANK-SPARSITY INCOHERENCE FOR MATRIX DECOMPOSITION VENKAT CHANDRASEKARAN, SUJAY SANGHAVI, PABLO A. PARRILO, S. WILLSKY AND ALAN arxiv:0906.2220v1 [math.oc] 11 Jun 2009 Abstract. Suppose we are given a

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Sparse Solutions of an Undetermined Linear System

Sparse Solutions of an Undetermined Linear System 1 Sparse Solutions of an Undetermined Linear System Maddullah Almerdasy New York University Tandon School of Engineering arxiv:1702.07096v1 [math.oc] 23 Feb 2017 Abstract This work proposes a research

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Optimal Linear Estimation under Unknown Nonlinear Transform

Optimal Linear Estimation under Unknown Nonlinear Transform Optimal Linear Estimation under Unknown Nonlinear Transform Xinyang Yi The University of Texas at Austin yixy@utexas.edu Constantine Caramanis The University of Texas at Austin constantine@utexas.edu Zhaoran

More information

Intrinsic Volumes of Convex Cones Theory and Applications

Intrinsic Volumes of Convex Cones Theory and Applications Intrinsic Volumes of Convex Cones Theory and Applications M\cr NA Manchester Numerical Analysis Martin Lotz School of Mathematics The University of Manchester with the collaboration of Dennis Amelunxen,

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Constrained optimization

Constrained optimization Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained

More information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department

More information

Regularization Methods for Prediction in Dynamic Graphs and e-marketing Applications

Regularization Methods for Prediction in Dynamic Graphs and e-marketing Applications Regularization Methods for Prediction in Dynamic Graphs and e-marketing Applications Emile Richard CMLA-ENS Cachan 1000mercis PhD defense Advisors: Th. Evgeniou (INSEAD), N. Vayatis (CMLA-ENS Cachan) November

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Sparse and Low Rank Recovery via Null Space Properties

Sparse and Low Rank Recovery via Null Space Properties Sparse and Low Rank Recovery via Null Space Properties Holger Rauhut Lehrstuhl C für Mathematik (Analysis), RWTH Aachen Convexity, probability and discrete structures, a geometric viewpoint Marne-la-Vallée,

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Lecture Notes 10: Matrix Factorization

Lecture Notes 10: Matrix Factorization Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To

More information

Extreme point inequalities and geometry of the rank sparsity ball

Extreme point inequalities and geometry of the rank sparsity ball Noname manuscript No. (will be inserted by the editor) Extreme point inequalities and geometry of the rank sparsity ball D. Drusvyatskiy S.A. Vavasis H. Wolkowicz Received: date / Accepted: date Abstract

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Zhaoran Wang Huanran Lu Han Liu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08540 {zhaoran,huanranl,hanliu}@princeton.edu

More information

Symmetric Factorization for Nonconvex Optimization

Symmetric Factorization for Nonconvex Optimization Symmetric Factorization for Nonconvex Optimization Qinqing Zheng February 24, 2017 1 Overview A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

IEOR 265 Lecture 3 Sparse Linear Regression

IEOR 265 Lecture 3 Sparse Linear Regression IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M

More information

Dense Error Correction for Low-Rank Matrices via Principal Component Pursuit

Dense Error Correction for Low-Rank Matrices via Principal Component Pursuit Dense Error Correction for Low-Rank Matrices via Principal Component Pursuit Arvind Ganesh, John Wright, Xiaodong Li, Emmanuel J. Candès, and Yi Ma, Microsoft Research Asia, Beijing, P.R.C Dept. of Electrical

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Yin Zhang Technical Report TR05-06 Department of Computational and Applied Mathematics Rice University,

More information

Sparse Principal Component Analysis Formulations And Algorithms

Sparse Principal Component Analysis Formulations And Algorithms Sparse Principal Component Analysis Formulations And Algorithms SLIDE 1 Outline 1 Background What Is Principal Component Analysis (PCA)? What Is Sparse Principal Component Analysis (spca)? 2 The Sparse

More information

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization Lingchen Kong and Naihua Xiu Department of Applied Mathematics, Beijing Jiaotong University, Beijing, 100044, People s Republic of China E-mail:

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

Compressed Sensing and Robust Recovery of Low Rank Matrices

Compressed Sensing and Robust Recovery of Low Rank Matrices Compressed Sensing and Robust Recovery of Low Rank Matrices M. Fazel, E. Candès, B. Recht, P. Parrilo Electrical Engineering, University of Washington Applied and Computational Mathematics Dept., Caltech

More information

A tutorial on sparse modeling. Outline:

A tutorial on sparse modeling. Outline: A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Lecture 26: April 22nd

Lecture 26: April 22nd 10-725/36-725: Conve Optimization Spring 2015 Lecture 26: April 22nd Lecturer: Ryan Tibshirani Scribes: Eric Wong, Jerzy Wieczorek, Pengcheng Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept.

More information

Semidefinite Programming

Semidefinite Programming Semidefinite Programming Notes by Bernd Sturmfels for the lecture on June 26, 208, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra The transition from linear algebra to nonlinear algebra has

More information

Self-Calibration and Biconvex Compressive Sensing

Self-Calibration and Biconvex Compressive Sensing Self-Calibration and Biconvex Compressive Sensing Shuyang Ling Department of Mathematics, UC Davis July 12, 2017 Shuyang Ling (UC Davis) SIAM Annual Meeting, 2017, Pittsburgh July 12, 2017 1 / 22 Acknowledgements

More information

Lecture Note 5: Semidefinite Programming for Stability Analysis

Lecture Note 5: Semidefinite Programming for Stability Analysis ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State

More information

arxiv:submit/ [cs.it] 16 Dec 2012

arxiv:submit/ [cs.it] 16 Dec 2012 Simultaneously Structured Models with Application to Sparse and Low-rank Matrices arxiv:submit/0614913 [cs.it] 16 Dec 01 Samet Oymak c, Amin Jalali w, Maryam Fazel w, Yonina C. Eldar t, Babak Hassibi c,

More information

Information-Theoretic Limits of Matrix Completion

Information-Theoretic Limits of Matrix Completion Information-Theoretic Limits of Matrix Completion Erwin Riegler, David Stotz, and Helmut Bölcskei Dept. IT & EE, ETH Zurich, Switzerland Email: {eriegler, dstotz, boelcskei}@nari.ee.ethz.ch Abstract We

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

arxiv: v3 [stat.me] 8 Jun 2018

arxiv: v3 [stat.me] 8 Jun 2018 Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

The Stability of Low-Rank Matrix Reconstruction: a Constrained Singular Value Perspective

The Stability of Low-Rank Matrix Reconstruction: a Constrained Singular Value Perspective Forty-Eighth Annual Allerton Conference Allerton House UIUC Illinois USA September 9 - October 1 010 The Stability of Low-Rank Matrix Reconstruction: a Constrained Singular Value Perspective Gongguo Tang

More information

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω

More information

ESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING

ESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING Submitted to the Annals of Statistics ESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING By Sahand Negahban and Martin J. Wainwright University of California, Berkeley We study

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Covariance Sketching via Quadratic Sampling

Covariance Sketching via Quadratic Sampling Covariance Sketching via Quadratic Sampling Yuejie Chi Department of ECE and BMI The Ohio State University Tsinghua University June 2015 Page 1 Acknowledgement Thanks to my academic collaborators on some

More information