(k, q)-trace norm for sparse matrix factorization
|
|
- Jason Patterson
- 5 years ago
- Views:
Transcription
1 (k, q)-trace norm for sparse matrix factorization Emile Richard Department of Electrical Engineering Stanford University Guillaume Obozinski Imagine Ecole des Ponts ParisTech Jean-Philippe Vert CBIO, Mines ParisTech Institut Curie, INSERM Abstract We propose a new convex formulation for the sparse matrix factorization problem, where the number of nonzero elements of the factors is fixed. An example is the problem of sparse principal component analysis, which we reformulate as a proximal operator computation through the definition of new matrix norms that can be used more generally as regularizers for various optimization problems over matrices. We estimate the Gaussian complexity for the suggested norms and their vector analogues. Surprisingly, in the matrix case, knowing in advance the blocks size improves the statistical power of the norms by an order of magnitude compared to the usual l and trace norms, whereas in the vector case the norms built using the knowledge of the number of nonzeros have the same complexity as the l norm. Introduction A range of supervised problems such as link prediction in graphs containing community structure [7], phase retrieval [4] or dictionary learning [3] amount to solve sparse factorization problems, i.e., to infer a low-rank matrix that can be factorized as the product of two sparse matrices with few columns (left factor) and few rows (right factor). Such a factorization allows more efficient storage, faster computation, more interpretable solutions and especially leads to more accurate estimates in many situations. In the case of interaction networks, for example, this is related to the assumption that the network is organized as a collection of highly connected communities which can overlap. More generally, considering sparse low-rank matrices combines two natural forms of sparsity, in the spectrum and in the support, which can be motivated by the need to explain systems behaviors by a superposition of latent processes which only involve a few parameters. Landmark applications of sparse matrix factorization are sparse principal components analysis [8, ] or sparse canonical correlation analysis [], which are widely used to analyze high-dimensional data such as genomic data.
2 We are motivated by developing convex approaches to estimate sparse low-rank matrices. In order to properly define matrix factorization, given sparsity levels of the factors denoted by integers k and q, we first introduce below the (k, q)-rank of a matrix as the minimum number of left and right factors, having respectively k and q nonzeros, required to reconstruct a matrix. This index is a more involved complexity measure for matrices than the rank in that it conditions on the number of nonzero elements of the left and right factors of a matrix. Using this index, we propose two new atomic norms for matrices [6] by considering its convex hull restricted to the unit ball of the operator norm, and element-wise l, resulting in convex surrogates to low (k, q)-rank matrix estimation problem. We provide an equivalent formulation of the norms as nuclear norms in the sense of [], highlighting in particular a link to the k-support norm of []. We analyze the Gaussian complexity of the new norms and compare them to other classical penalties. Our analysis shows that while in the vector case the simple l norm provides similar statistical power than the relaxation obtained by knowing in advance the number of nonzeros of the unknown, in the matrix case one can gain a factor by considering the more evolved relaxations. Nonetheless, while in the vector case the computation remains feasible in polynomial time, the norms we introduce for matrices can not be evaluated in polynomial time. We propose algorithmic schemes to approximately learn with the new norms. The same norms and meta-algorithms can be used as a regularizer in supervised problems such as multitask learning or quadratic regression and phase retrieval, highlighting the fact that our algorithmic contribution does not consist in providing more efficient solutions to the rank- SPCA problem, but to combine atoms found by the rank- solvers in a principled way. We finally show experiments on real data using bilinear regression, demonstrating the flexibility of our norm in formulating various matrix valued problems.. Notations Throughout the text, m, m, k and q are integers representing respectively the size m m of the matrices we consider, and the number of non-zero components in left (k) and right (q) factors when we have a lowrank matrix. Given matrices A and B of the same size, A, B = tr(a B) is the standard inner product of matrices. For any matrix Z R m m the notations Z 0, Z, Z, Z, Z op and rank(z) stand respectively for the number of nonzeros, entry-wise l and l norms, the trace-norm (or nuclear norm, the sum of the singular values), the operator norm (the largest singular value) and the rank of Z. supp(z) {,, m } {,, m } is the set of indices of nonzero elements of Z. When dealing with a matrix Z whose nonzero elements form a block, supp(z) takes the form I J where I {, m } and J {, m }. G k (resp. G q ) is the set of subset of k indices in {,, m } (resp. q indices in {,, m }). For a matrix Z and two subsets of indices I {, m } and J {, m }, Z I,J is the matrix having the same entries as Z inside the index subset I J, and 0 entries outside. Tight convex relaxations of sparse factorization constraints In this section we propose two new matrix norms helpful to define convex formulations of various sparse matrix factorization problems. We start by defining the (k, q)-rank of a matrix in Section., a useful generalization of the rank which also quantifies the sparseness of a matrix factorization. We then introduce two atomic norms defined as tight convex relaxations of (k, q)-rank: the (k, q)-trace norm (Section.), obtained by relaxing the (k, q)-rank over the operator norm ball, and the (k, q)-cut norm (also Section.), obtained by relaxing the (k, q)-rank over the element-wise l ball. We finally relate these matrix norms to vector norms using the concept of nuclear norms, due to[], linking in particular the (k, q)-trace norm for matrices to the k-support norm of [].
3 . The (k, q)-rank of a matrix The rank of a matrix Z is the minimum number of rank- matrices (i.e., outer products of of vectors of the form ab ) needed to express Z as a linear combination Z = r i= a ib i. It is a versatile concept, central to define matrix factorization problems and low-rank approximations. The following definition generalizes this notion to incorporate conditions on the sparseness of the rank- elements: Definition ((k, q)-svd and (k, q)-rank) For a matrix Z R m m, we call a decomposition of the form Z = r i= c ia i b i with c c c r 0, a i b i A k,q and with minimal r a (k, q)-sparse singular value decomposition of Z, or (k, q)-svd. We refer to vectors a i and b i as the left and right sparse singular vectors of Z, to c i as its (k, q)-sparse singular values, and to r as its (k, q)-rank. The (k, q)-svd is not necessarily unique and its factors are in general not orthogonal (see the appendix for some counter-examples). The non-orthogonality of factors is a major difference with the usual SVD, with important algorithmic consequences. This suggests that the idea of enforcing orthogonality between the sparse factors as done in previous work [, 0], would not be a good idea in our context. In particular, the example we study in the appendix does not have any sparse SVD with orthogonal factors. Note that by definition rank(z) = (m, m )-rank(z) and Z 0 = (, )-rank(z), so that we have the tight inequalities: rank(z) (k, q)-rank(z) Z 0. The (k, q)-rank is useful to formalize problems such as sparse matrix factorization, which can be defined as approximating the solution of a matrix valued problem by a matrix having low (k, q)-rank. For instance the standard rank- SPCA problem consists in finding the matrix of k-rank = approximating best the sample covariance matrix []. Unfortunately, the (k, q)-rank is a nonconvex index, like the rank or the cardinality, leading to computational difficulties when one wants to estimate matrices with small (k, q)-rank. In the sequel, we therefore define and study convex relaxations of the (k, q)-rank that could replace it to quantify the sparseness and low-rankness of matrices.. Convex relaxations of the (k, q)-rank In order to define convex relaxations of the (k, q)-rank, it is useful to recall the definition of atomic norms as defined by [6]: Definition (atomic norm) Given a centrally symmetric compact subset A R p of elements called atoms, the atomic norm induced by A on R p is the gauge function [9, p. 8] of A: x A = inf {t > 0 : x tconv(a)}. () [6] show that the atomic norm is indeed a norm, which can be rewritten as: { } x A = inf c a a c a 0 a A, () and whose dual norm satisfies x A a A c a : x = a A = sup { x, z : z A } = sup { x, a : a A}. We can now propose our first convex relaxation of the (k, q)-rank as an atomic norm induced by a particular set of atoms: (3) 3
4 Definition 3 ((k, q)-trace norm) The (k, q)-trace norm Ω k,q (Z) is the atomic norm induced by the following set of atoms: A k,q = {ab : a R m, b R m, a 0 k, b 0 q, a = b = }. (4) In other words, A k,q is the set of matrices Z R m m such that (k, q)-rank(z) = and Z op = ; the (k, q)-trace norm is therefore the convex relaxation of the (k, q)-rank over the operator norm unit ball. Besides the characterization of Ω k,q (Z) as a gauge function (), the following lemma (whose proof is postponed to the Appendix) provides an alternative characterization of Ω k,q and its dual: Lemma For any Z R m m we have Ω k,q (Z) = inf Z (I,J) : Z = Z (I,J), supp(z (I,J) ) I J, (5) (I,J) G k G q (I,J) and Ω k,q(z) = max { Z I,J op : I G k, J G q }. (6) Our second norm is obtained by focusing on a more restricted set of atoms, motivated by applications where in addition to being sparse and low-rank, we want to estimate matrices which are constant over blocks: Definition 4 ((k, q)-cut norm) The (k, q)-cut norm Z (k,q)-cut is the atomic norm induced by the following set of atoms: a R m, a 0 = k, i supp(a) a i = A 0 k k,q = ab : b R m, b 0 = q, j supp(b) b j =. (7) q In other words, the atoms in A 0 k,q are the atoms of A k,q where the nonzero elements of A have all the same amplitude; the (k, q)-cut norm is therefore the convex relaxation of the (k, q)-rank over the l norm unit ball. The name of it was chosen by analogy with the CUT polytope of [9]..3 Equivalent nuclear norms built upon vector norms Let us now consider a different approach to define convex matrix norms, by building nuclear norms for matrices []. These constructions can be seen as a key to write alternate minimization algorithms. For that purpose it is useful to recall the following general definition of nuclear norms and the characterization of corresponding dual norms, taken from [, Propositions.9 and.]: Proposition (nuclear norm) Let α and β denote any vector norms on R m and R m, respectively, then { ν(z) = inf a i α b i β : Z = } a i b i, i i where the infimum is taken over all summations of finite length, is a norm over R m m called the nuclear norm induced by α and β. Its dual is given by ν (Z) = sup {a Zb : a α, b β }. (8) 4
5 L k support κ k Figure : Unit balls edges of 3 norms of interest in the vector p = 3 dimensional case where the number of nonzero elements is fixed to k =. Vertices of the (, )-CUT polytope constitute the A 0, set. The following result shows that the nuclear norm induced by two atomic norms (Definition ) is itself an atomic norm. The proof is straightforward and results from writing the dual norm thanks to 8 and observing that it is equal to the dual of the corresponding atomic norm using the expression.. Lemma If α and β are two atomic norms on R m and R m induced respectively by two atom sets A and A, then the atomic norm on R m m induced by α and β is an atomic norm induced by the atom set: A = { ab } : a A, b A. For instance if α and β are taken to be the l norm for vectors, which is the atomic norm induced by unit-norm one-sparse vectors {±e i }, the resulting nuclear norm is simply the element-wise l norm of the matrix, induced by atoms {±e i,j }. We now show that the (k, q)-trace norm and the (k, q)-cut norm are both nuclear norms, induced by specific vector norms: Theorem ) The (k, q)-trace norm is the nuclear norm induced by θ k and θ q, the k- and q-support norms of [] given by ( k r p ) θ k (w) = w (i) + w (i), (9) r + i= i=k r where w (i) denotes the ith largest in absolute value entry of w, r {0,, k } is the unique integer such that w (k r ) > p r+ i=k r w (i) w (k r), and where by convention w (0) =. ) The (k, q)-cut norm is the nuclear norm induced by the vector norms κ k ( ) and κ q ( ) defined by κ k (w) = ( max w, ) k k w. (0) 5
6 .4 Subdifferential of the (k, q)-trace norm on atoms The following lemma provides an explicit description of the subdifferential of Ω k,q at an atom A = ab A k,q, which will be useful to study the statistical properties of this norm as a regularizer in the next section. Lemma 3 The subdifferential of Ω k,q at A A k,q is Ω k,q (A) = { A + Z : AZ I 0,J 0 = 0, A Z I0,J 0 = 0, (I, J) G k G q A I,J + Z I,J op }. () The characterization of the subdifferential of the norm at a point is important to understand how easily the point can be recovered using the norm as a penalty. A geometric intuition to this fact can be obtained by looking at the normal cone, the conic hull of the subdifferential. The larger the normal cone at a point is, the more tractable the estimation of the point is in statistical terms when the norm is used as a convex penalty. By comparing the dimension of the subspace in which Z lives in Lemma 3, namely m m k q + with the analogue subspace in case of the trace norm {Z : AZ = 0, A Z = 0} which has dimension m m m m +, and for the l norm (disjoint support matrices) that has dimension m m kq one expects this norm to have a much larger subdifferential, resulting in better statistical performance. In the Section 3 we bring a quantitative support to this observation..5 Positive semidefinite matrices The case of the decomposition of a symmetric positive semidefinite matrix Σ deserves a specific discussion. It might indeed be desirable to obtain a decomposition of the form Σ = r i= c ia i a i with c i in addition possibly constrained to be non-negative. This is easily obtained by replacing the set of atoms by A k, = {aa, a A k }, or A k,sym = {±aa a A k } and by constructing the corresponding atomic norms, which are very similar to the ones we presented. It should be noted to be accurate that the set A k, is not centrally symmetric and does not induce a norm but simply a gauge, which does not change the claims of this work which only rely on convexity of the gauge. When using the corresponding gauges as regularizer, the gauge of A k, will have the advantage of guaranteeing that the solution of the problem is p.s.d., which might not be the case for the other atomic norms..6 Using Ω k,q as a regularizer We suggest to use the (k, q)-trace norm in a number of applications as a regularizer. This means that once the matrix estimation problem is formulated as the minimization of a convex loss l : R m m R, we consider solving the penalized problem of minimizing l(z) + λω k,q (Z) for a parameter λ > 0. The simplest loss to consider is the squared Euclidean distance from an observation Y : l(z) = Z Y F, which can be used for denoising an observation Y = Z + G where G stands for the noise and Z for the matrix we want to estimate. An important instance of such problem is the sparse PCA problem, where based on the sample covariance matrix ˆΣ n, one aims at solving successively rank- problems of finding the leading sparse eigenvector max z ˆΣn z s.th. z =, z 0 k, (rank- SPCA) z and switching to the next component using some heuristic []. Rather than solving a sequence of problems, we formulate the problem of estimating r principal components having each k nonzero elements as { } min ˆΣ n Z F : k-rank(z) r and Z 0, () and relax the nonconvex problem () to the following convex optimization problem that can be interpreted as projecting onto the portion of the unit ball of Ω k inside the PSD cone, or as computing the proximal map 6
7 of Ω k : { } min ˆΣ n Z F + λω k (Z) : Z 0. Another example of loss function comes from linear regression where an observation y = X (Z ) + ɛ is available with X : R m m R n linear and ɛ modeling the noise process. We may use the least-squared loss l(z) = X (Z) y to estimate Z. Bilinear regression is an example where the linear map X is given by X (Z) = vec(x ZX ), X R n m, X R n m being design matrices. A particular instance of bilinear regression is quadratic regression, which under PSD constraint Z 0 is closely related to phase retrieval [4]. In this setup, for a design matrix X R n p, we define X (Z) = diag(xzx ). We expect Z to be sparse, motivated by the small number of active variables in the data and also assume that the organization of the features interaction in an even smaller number of latent groups makes the true parameter Z low-rank. 3 Statistical properties of suggested regularizers As discussed in the application the norms we introduce are designed to be used as regularizers in various supervised and unsupervised problems. We briefly recall a set of relevant statistical guarantee results that can be stated based on computing the expected dual norm of a noise process and the statistical dimension of the regularizer at particular points. In Section 3. we study the denoising problem. We compare the minimum mean squared errors of denoising a single spike corrupted with obtained using different regularizers. In Section 3. we focus on computing the statistical dimension of objects of interest both in matrix and vector cases and compare the asymptotic rates, which sheds the light on the power of the norms we study when used as convex penalties. 3. Minimum square error in denoising Knowing the expected value of dual norm of noise process is informative for denoising. Consider the denoising problem where Y = Z + G is observed and we aim at estimating Z knowing that it has low (k, q)-rank. Following [3] Theorem we know that when the parameter λ is chosen large enough λ E Ω k,q G), the proximal or soft-thresholding operator associated with Ω k,q, namely Ẑ = arg min Z R m m Z Y + λω k,q (Z) satisfies E Ẑ Z F λ Ω k,q(z ). For instance if the model is Y = Z +G with G R m m having iid gaussian entries drawn from N (0, σ ), we can compute the expected value of the dual norm, which bounds the squared dual norm E Ω k,q (G). Lemma 4 By taking the expectation of the dual norm over matrices G R m m with entries drawn iid from N (0, ), we have ( E Ω k,q(g) k log m k + k/4 + q log m ) q + q/4 (3) more classical results are similar quantities for l and operator norms: E G op m + m and E G log (m m ). As a consequence, under centered gaussian noise with variance σ, we have 7
8 Matrix norm (k, q)-trace trace l Minimum Square Error k log m k + q log m q m + m kq log(m m ) Min. S. E. for k = m m /4 log m m m log m Table : Various norms minimum square error in denoising an atom ab A 0 k,q corrupted with unit variance gaussian noise. The column k = m corresponds to the order of magnitudes in the regime where m = m = m and k = q = m. Theorem Let G R m m with entries drawn iid from N (0, ) and Y = Z + σ G a noisy observation. ( The estimator Ẑ = arg min Z Z Y F + λω k,q(z), where λ = k ) σω k,q (Z ) log m k + k/4 + q log m q + q/4 satisfies ( E Ẑ Z F σ Ω k,q (Z ) k log m k + k/4 + q log m ) q + q/4. Similar estimators built using the l norm, namely Z = arg min Z Z Y F + λ Z and Z = arg min Z Z Y F + λ Z where λ = σ kq log(m m ) and λ = σ( m + m ) satisty E Z Z F σ Z log(m m ) and E Z Z F σ Z ( m + m ) A fundamental example is the so-called single spike model where the observation consists of an atom with equal in magnitude entries ab A 0 k,q corrupted with additive noise Y = + σg. In this situation one ab may want to compare denoising with Ω k,q with denoising using the trace norm or the l norm. We obtain, thanks to ab = kq kq = kq, Ω k,q (ab ) = ab = the following Corollary When Z = ab A k,q is an atom, the estimators defined in Theorem satisfy ( E Ẑ ab F σ k log m k + k/4 + q log m ) q + q/4, and E Z ab F σ kq log(m m ) and E Z ab F σ( m + m ) This is a first observation exhibiting the power of this regularizer, and interesting orders of magnitudes are put together in Table to make the comparison easy. Note that in case ab A k,q gets far from A 0 k,q elements, for instance when getting close to e e, then the bound gets tighter for the l norm and reaches σ log(m m ) on e e while not changing for the two other norms. 3. The statistical dimension of a convex function at a given point Interesting tools have recently been used by [6, 4,, 0] to quantify the statistical power of a convex nonsmooth regularizer used as a constraint or penalty. The gaussian complexity term has a nice geometric interpretation: it quantifies the width of the cone of descent directions locally. Interestingly, this quantity is related to sample complexity terms for compressed sensing, signal denoising and demixing applications. For conciseness, we refer to the statistical dimension of a function at a point to the quantity that [] call the statistical dimension of the tangent cone associated with the function at the given point. This quantity equals 8
9 (up to an additive constant ) to the squared gaussian width of the tangent cone intersected with the unit euclidean sphere used by [6], see [] for technical details. To give a concise definition of what we refer to as the statistical dimension, let for convex function f, N f (w) denote the normal cone of f at point w: that is the conic hull of the subdifferential. Equivalently one can define the normal cone as being the polar cone of the set of descent directions, also called the tangent cone. We denote the statistical dimension of a convex function f : R p R at w R p by S(w, f) = E dist(g, N f (w)) where dist denotes the euclidean distance of the gaussian vector g N (0, I p ) from the normal cone N f (w) of f at point w. To motivate our particular focus on computing this specific complexity term, we recall a non-exhaustive list of results formulated using the statistical dimension. Exact recovery. Having observed y = Xw with X R n p having iid entries drawn from N (0, /n). The solution to min f(w) s.th. Xw = y w equals w with overwhelming probability as soon as n S(w, f), see [6, Corollary 3.3]. In addition [] proved (Theorem II) that the phase transition is centered at S(w, f) with a width not exceeding p in magnitude. Robust recovery. Given a corrupted observation y = Xw + ɛ where the corruption ɛ R n is assumed to be bounded ɛ δ, then ŵ = arg min f(w) s.th. Xw y δ w satisfies ŵ w δ/η with overwhelming probability as soon as n (S(w, f) + 3 )/( η), see [6, Corollary 3.3]. Denoising. Assume a collection of noisy observations x i = w + σɛ i for i =,, n is available where ɛ i N (0, I p ) are iid, and let y = n n i= x i denote their average. [5] prove in Proposition 4 that ŵ = arg min w y s.th. f(w) f(w ) w satisfies E ŵ w σ n S(w, f). Demixing. Given y = Uw + v with U R p p orthogonal and w, v R p, [] proved in Theorem III that the pair of (ŵ, ˆv) solutions of min f(w) s.th. g(v) g(v ) and y = Uw + v equals (ŵ, ˆv) with probability η provided that S(w, f) + S(v, g) p 4 p log 4 η. Conversely if S(w, f) + S(v, g) p + 4 p log 4 η the demixing succeeds with probability η. 3.3 Statistical dimensions of suggested regularizers We will focus on computing the statistical dimension at atoms of the suggested norms. This has two principal motivations. The first is to derive a number of potential corollaries from our results for potential applications of this regularizer. The second is the comparison of the statistical dimension of the suggested norm to those of related functions. The comparison provides a better understanding of the geometry of the problem and 9
10 locates it next to better known objects such as the trace norm or the CUT polytope. The analogy with the vector case is particularly surprising: the knowledge of the number of nonzeros does not allow to gain an order of magnitude in the vector case, whereas the gain is significant in the matrix case, see Section 3.4. Let us start with a very result showing that the (k, q)-trace norm is at least as statistically powerful as any convex combination of l and trace norm. By restriction to the PSD cone this result shows statistical superiority of the suggested norm to the more standard SDP relaxations of the sparse PCA problem [7]. Lemma 5 Let A B and. A and. B denote the gauges associated with convex hulls of corresponding sets. For any z A, we have. the inclusion of normal cones N A (z) N B (z). for any g R p, dist(g, N A (z)) dist(g, N B (z)) 3. S(z,. A ) S(z,. B ) The following result shows Ω k,q is always at least as good as the standard l + trace norm penalty popular to estimate sparse low-rank matrices [8] on A 0 k,q. Proposition Fix a pair of real numbers λ, ν > 0 and let f denote the function Z λ Z + ν Z. The normal cone to Ω k,q at A = ab A 0 k,q contains the normal cone to f and is contained in the normal cone of the (k, q)-cut polytope at A: N f (A) N Ωk,q (A) N (k,q)-cut (A). The proof can be found in Appendix. We recall the following lower bound of order of kq (m +m ) on the statistical dimension of the sum of l plus trace norm before moving to state upper bounds on (k, q)-trace and (k, q)-cu T norms statistical dimensions which improve on the previous by achieving asymptotical rates of order k log m + q log m. Proposition 3 ([5] Theorem 3.3) Let λ, µ > 0 and consider ν : Z λ Z + µ Z. Then for every A A k,q, there exists a constant C > 0 such that the upper bound S(A, ν) C (kq (m + m )) can not be improved in terms of scaling in k, q, m, m. Before turning on to more advanced results let us introduce the following coefficient that quantifies how close a point ab A k,q is to the subset A 0 k,q. Definition 5 (Dispersion coefficient) Let A = ab define the dispersion coefficient γ(a, b) (0, 4 ] as A k,q and let I 0 = supp(a) and J 0 = supp(b). We γ(a, b) = 4 min ( k a i I i, q b ) j, 0 j J 0 the maximal value 4 being reached for vectors a and b having nonzero entries of the same amplitude k and q respectively. 0
11 Proposition 4 Let A = ab A k,q, and let γ = γ(a, b) denote their dispersion coefficient. The statistical dimension of Ω k,q on A is bounded by ( 6 (k + q) ) ( k log m γ γ k + q log m ) q Corollary In case A A 0 k,q, we get γ = 4, so statistical dimension is bounded by a factor of the order k log m + q log m, which fits the rates obtained for the (k, q)-cut norm (see Proposition 6) up to k log k + q log q k log m + q log m. In the vector case, thanks to the closed form expression of the k-support norm (see 9), we can obtain the following bound which has the same asymptotic rate with smaller constants than the upper bound obtained by simply setting m = q = in Proposition 4. Proposition 5 The statistical dimension of the k-support norm at a s-sparse vector w R p is bounded by { ( ) } 5 4 k + θ k (w) (r + ) s i=k r w log p 5 ( ) /3 θ k (w) (r + ) (i) k 3 s i=k r w (i) (p k) p, where w (i) denotes the i-th entry of w sorted in descending order of absolute values. θ k (w) s i=k r w (i) The term (r + ) ). We distinguish 4 regimes: is the vector analogous of the dispersion coefficient γ(a, b) (see Definition 5. In case w A k, \A 0 k, is k-sparse with entries not equal in absolute value, r = 0 and θ k(w) =. Therefore (r + ) min i I0 θ k (w) s i=k r w (i) = min i I0 S(w, sp k ) { 5 4 k + min i I0 w i, the bound becomes w i log p k } 5 ( 3 (p k) min i I0 w i ) /3 p.. In the most favorable case, when all the s = k nonzero elements take the same absolute value w A 0 k,, then r = k takes its maximum value. In this case the norm is proportional to the l norm and we bound the statistical dimension by 5 4k + k log(p/k) which is exactly equal to the statistical dimension of the l norm on a k-sparse point. Note that the bound is continuous on atoms. For w A 0 k, we had a dispersion coefficient w which tends to i k as w gets close to A 0 k,q, and this fits 5 4k + k log(p/k). Substituting m = q = in Proposition 6 on the (k, q)-cut polytope, we get a term of the same order of magnitude with a larger constant, which we believe is due to the proof techniques used in [6] Theorem 3.9 and Corollary 3.4. The order of magnitude however can not be improved and this ensures the tightness of the bound up to a multiplicative factor. On Figure one can distinguish the points having coordinates with the same absolute value on the intersection of the three unit ball edges, represented in green red and blue. The tightness between the l and k-support norm tangent cones on A 0 k,q can be clearly understood in this Figure. 3. An intermediary regime exists where s i=k r w (i) /θ k (w) is of the order of p. In this case we get ( θ the asymptotic behavior in O ( k (w) s i=k r w p) ). /3 (i) 4. In the other extreme case which is the the less favorable, when r = 0 and ( s i=k r w (i) ) /θ k (w) much larger than p, the k-support norm is proportional to the l norm in a neighborhood of the point. This situation happens at points close to canonical elements, e for instance, where the norm is differentiable. In these cases the gaussian width is trivially bounded by p.
12 Matrix norm Statistical dimension k = m Vector norm Statistical dimension (k, q)-trace k log m + q log m m log m k-support k log p k (k, q)-cut k log m m k + q log q m log m κk k log p k l kq log mm kq m log m l k log p k Trace m + m m l p l + trace kq (m + m ) m elastic net k log p k CUT m + m m + m l p Table : Order of magnitude of various norms statistical dimension on atoms located on the subset A 0 k,q. The l norm for matrices is the element-wise l norm and the trace norm in the vector case (p = m, m = ) reduces to the l norm. The column k = m corresponds to the order of magnitudes in the case of the planted clique problem where m = m = m and the clique size is set to k = q = m. We now study the polyhedral functions defined by using the gauge of A 0 k,q we named (k, q)-cut norm. Proposition 6 The statistical dimension of the (k, q)-cut norm on elements of A ( 0 k,q 9 k log m m k + q log q ). + (k + q)( + log ) is bounded by An elementary corollary of Proposition 6 provides the statistical dimension of κ k by setting q = : a A 0 k,, S(a, κ k ) 9k log p + 9( + log )k. k In order to compare this polytope with the CUT polytope see e.g. [9] we note that any matrix A = ab A 0 k,q lies on 4-facets of the polytope kqcut where CUT = conv{ab, a {±} m, b {±} m }. We can state the following result on the statistical dimension of this norm, obtained by using the same argument as above and by noticing that CUT has m+m vertices. Proposition 7 The statistical dimension of the CUT norm on A 0 k,q is bounded by O(m + m ). 3.4 Comparison of norms properties The study of the vector case completes the comparison with a surprising message: the construction of the (k, q)-trace norm leads to a significant improvement towards the simple rivals (l and trace norm), but in the vector case the k-support norm does not lead to a significant benefit in the vector case. The same holds for the polyhedral norms defined through the (k, q)-cut and κ k norms. In fact the bound obtained in Proposition 5 fits the l -norm order of magnitude k log p k. This was in fact expected (and also tight) as even the norm induced by the gauge of the convex hull of A 0 k, in the vector case, namely the (k, )-CUT norm leads to the same order of magnitude k log p k. In contrast the analogous tight relaxation in the vector case does not lead to an improvement in the statistical dimension rates. We point out that taking k = m as in the planted clique setting we obtain a statistical dimension of m log(m) which is small compared to the O(m) rate obtained using the CUT polytope or also the trace norm, or the m log(m) rates using lifted trace norms [6]. In Table we can find the statistical dimensions of different norms on the most favorable case of A 0 k,q atoms. A representation in the vector case of the edges of the atom sets can be found in Figure.
13 In the Figure the A 0 k, elements are at the intersection of the 3 represented edges and one clearly sees that Tangent cones coincide on these points, which gives an intuition of the same order of magnitude found for the statistical dimensions. A last interesting remark is that all the norms except the element-wise l -norm have closed-form expressions in terms of the vector elements but the matrix norms only have variational expressions. This is due to a basic fact: ordering vectors elements is possible, but not matrix elements. Therefore computing the dual norm is feasible only in the element-wise l norm for the matrix norms. References [] D. Amelunxen, M. Lotz, M. Mccoy, and J. Tropp. Living on the edge: a geometric theory of phase transition in convex optimization. submitted, 03. [] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k support norm. NIPS, 0. [3] B.N. Bhaskar, G. Tang, and B. Recht. Atomic norm denoising with applications to line spectral estimation. Preprint, 0. [4] E.J. Candès, T Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 0. [5] V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. Proceedings of the National Academy of Sciences,, 03. [6] V. Chandrasekaran, B. Recht, P.A. Parrilo, and A.S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics,, 0. [7] A. d Aspremont, F. Bach, and L. ElGhaoui. Optimal solutions for sparse principal component analysis. The Journal of Machine Learning Research, 9, 008. [8] A. d Aspremont, L. El Ghaoui, M.I. Jordan, and G.R.G. Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 007. [9] M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springer, 996. [0] R. Foygel and L. Mackey. Corrupted sensing: Novel guarantees for separating structured signals. Preprint, 03. [] G.J.O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Cambridge University Press, 987. [] L. Mackey. Deflation methods for sparse pca. NIPS, 008. [3] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 00. [4] S. Oymak and B. Hassibi. On a relation between the minimax denoising and the phase transitions of convex functions. in preparation, 0. [5] S. Oymak, A. Jalali, M. Fazel, Y. Eldar, and B. Hassibi. Simultaneously structured models with application to sparse and low-rank matrices. submitted, 0. [6] E. Richard, F. Bach, and J.-P. Vert. Intersecting singularities for multi-structured estimation. ICML, 03. [7] E. Richard, S. Gaiffas, and N. Vayatis. Link prediction in graphs with autoregressive features. NIPS, 0. [8] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low-rank matrices. In Proceeding of 9th Annual International Conference on Machine Learning, 0. [9] R.T. Rockafellar. Convex Analysis. Princeton Univ. Press,
14 [0] V. Vu, J. Cho, J. Lei, and K. Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. NIPS, 03. [] D. M Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 009. [] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics,
Tractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationConvex sets, conic matrix factorizations and conic rank lower bounds
Convex sets, conic matrix factorizations and conic rank lower bounds Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute
More informationOptimisation Combinatoire et Convexe.
Optimisation Combinatoire et Convexe. Low complexity models, l 1 penalties. A. d Aspremont. M1 ENS. 1/36 Today Sparsity, low complexity models. l 1 -recovery results: three approaches. Extensions: matrix
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationThe Convex Geometry of Linear Inverse Problems
Found Comput Math 2012) 12:805 849 DOI 10.1007/s10208-012-9135-7 The Convex Geometry of Linear Inverse Problems Venkat Chandrasekaran Benjamin Recht Pablo A. Parrilo Alan S. Willsky Received: 2 December
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationConditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint
Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Ronny Luss Optimization and
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationGeneralized Power Method for Sparse Principal Component Analysis
Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work
More informationCOMPRESSED Sensing (CS) is a method to recover a
1 Sample Complexity of Total Variation Minimization Sajad Daei, Farzan Haddadi, Arash Amini Abstract This work considers the use of Total Variation (TV) minimization in the recovery of a given gradient
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationLow-Rank Matrix Recovery
ELE 538B: Mathematics of High-Dimensional Data Low-Rank Matrix Recovery Yuxin Chen Princeton University, Fall 2018 Outline Motivation Problem setup Nuclear norm minimization RIP and low-rank matrix recovery
More informationCorrupted Sensing: Novel Guarantees for Separating Structured Signals
1 Corrupted Sensing: Novel Guarantees for Separating Structured Signals Rina Foygel and Lester Mackey Department of Statistics, Stanford University arxiv:130554v csit 4 Feb 014 Abstract We study the problem
More informationSignal Recovery from Permuted Observations
EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,
More informationRecovery of Simultaneously Structured Models using Convex Optimization
Recovery of Simultaneously Structured Models using Convex Optimization Maryam Fazel University of Washington Joint work with: Amin Jalali (UW), Samet Oymak and Babak Hassibi (Caltech) Yonina Eldar (Technion)
More informationComposite Loss Functions and Multivariate Regression; Sparse PCA
Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.
More informationLecture Notes 9: Constrained Optimization
Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form
More informationarxiv: v1 [math.st] 10 Sep 2015
Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and
More informationThe Convex Geometry of Linear Inverse Problems
The Convex Geometry of Linear Inverse Problems Venkat Chandrasekaran m, Benjamin Recht w, Pablo A. Parrilo m, and Alan S. Willsky m m Laboratory for Information and Decision Systems Department of Electrical
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationThe convex algebraic geometry of linear inverse problems
The convex algebraic geometry of linear inverse problems The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationParcimonie en apprentissage statistique
Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationConditions for Robust Principal Component Analysis
Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and
More informationOn the Convex Geometry of Weighted Nuclear Norm Minimization
On the Convex Geometry of Weighted Nuclear Norm Minimization Seyedroohollah Hosseini roohollah.hosseini@outlook.com Abstract- Low-rank matrix approximation, which aims to construct a low-rank matrix from
More informationShort Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning
Short Course Robust Optimization and Machine Machine Lecture 4: Optimization in Unsupervised Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 s
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationThe convex algebraic geometry of rank minimization
The convex algebraic geometry of rank minimization Pablo A. Parrilo Laboratory for Information and Decision Systems Massachusetts Institute of Technology International Symposium on Mathematical Programming
More informationFrom Compressed Sensing to Matrix Completion and Beyond. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison
From Compressed Sensing to Matrix Completion and Beyond Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison Netflix Prize One million big ones! Given 100 million ratings on a
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationAnalysis of Robust PCA via Local Incoherence
Analysis of Robust PCA via Local Incoherence Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yi Zhou Department of EECS Syracuse University Syracuse, NY 3244 yzhou35@syr.edu
More informationSparse and Low-Rank Matrix Decompositions
Forty-Seventh Annual Allerton Conference Allerton House, UIUC, Illinois, USA September 30 - October 2, 2009 Sparse and Low-Rank Matrix Decompositions Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo,
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationROBUST BLIND SPIKES DECONVOLUTION. Yuejie Chi. Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 43210
ROBUST BLIND SPIKES DECONVOLUTION Yuejie Chi Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 4 ABSTRACT Blind spikes deconvolution, or blind super-resolution, deals with
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationSparse PCA with applications in finance
Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction
More informationarxiv: v3 [cs.it] 25 Jul 2014
Simultaneously Structured Models with Application to Sparse and Low-rank Matrices Samet Oymak c, Amin Jalali w, Maryam Fazel w, Yonina C. Eldar t, Babak Hassibi c arxiv:11.3753v3 [cs.it] 5 Jul 014 c California
More informationarxiv: v3 [math.oc] 19 Oct 2017
Gradient descent with nonconvex constraints: local concavity determines convergence Rina Foygel Barber and Wooseok Ha arxiv:703.07755v3 [math.oc] 9 Oct 207 0.7.7 Abstract Many problems in high-dimensional
More informationGaussian Phase Transitions and Conic Intrinsic Volumes: Steining the Steiner formula
Gaussian Phase Transitions and Conic Intrinsic Volumes: Steining the Steiner formula Larry Goldstein, University of Southern California Nourdin GIoVAnNi Peccati Luxembourg University University British
More informationLearning with Submodular Functions: A Convex Optimization Perspective
Foundations and Trends R in Machine Learning Vol. 6, No. 2-3 (2013) 145 373 c 2013 F. Bach DOI: 10.1561/2200000039 Learning with Submodular Functions: A Convex Optimization Perspective Francis Bach INRIA
More informationMulti-stage convex relaxation approach for low-rank structured PSD matrix recovery
Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery Department of Mathematics & Risk Management Institute National University of Singapore (Based on a joint work with Shujun
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationarxiv: v1 [math.oc] 11 Jun 2009
RANK-SPARSITY INCOHERENCE FOR MATRIX DECOMPOSITION VENKAT CHANDRASEKARAN, SUJAY SANGHAVI, PABLO A. PARRILO, S. WILLSKY AND ALAN arxiv:0906.2220v1 [math.oc] 11 Jun 2009 Abstract. Suppose we are given a
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationSparse Solutions of an Undetermined Linear System
1 Sparse Solutions of an Undetermined Linear System Maddullah Almerdasy New York University Tandon School of Engineering arxiv:1702.07096v1 [math.oc] 23 Feb 2017 Abstract This work proposes a research
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More informationOptimal Linear Estimation under Unknown Nonlinear Transform
Optimal Linear Estimation under Unknown Nonlinear Transform Xinyang Yi The University of Texas at Austin yixy@utexas.edu Constantine Caramanis The University of Texas at Austin constantine@utexas.edu Zhaoran
More informationIntrinsic Volumes of Convex Cones Theory and Applications
Intrinsic Volumes of Convex Cones Theory and Applications M\cr NA Manchester Numerical Analysis Martin Lotz School of Mathematics The University of Manchester with the collaboration of Dennis Amelunxen,
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationConstrained optimization
Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained
More informationFast Angular Synchronization for Phase Retrieval via Incomplete Information
Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department
More informationRegularization Methods for Prediction in Dynamic Graphs and e-marketing Applications
Regularization Methods for Prediction in Dynamic Graphs and e-marketing Applications Emile Richard CMLA-ENS Cachan 1000mercis PhD defense Advisors: Th. Evgeniou (INSEAD), N. Vayatis (CMLA-ENS Cachan) November
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationSparse and Low Rank Recovery via Null Space Properties
Sparse and Low Rank Recovery via Null Space Properties Holger Rauhut Lehrstuhl C für Mathematik (Analysis), RWTH Aachen Convexity, probability and discrete structures, a geometric viewpoint Marne-la-Vallée,
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationExtreme point inequalities and geometry of the rank sparsity ball
Noname manuscript No. (will be inserted by the editor) Extreme point inequalities and geometry of the rank sparsity ball D. Drusvyatskiy S.A. Vavasis H. Wolkowicz Received: date / Accepted: date Abstract
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationNew Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit
New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence
More informationTighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time
Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Zhaoran Wang Huanran Lu Han Liu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08540 {zhaoran,huanranl,hanliu}@princeton.edu
More informationSymmetric Factorization for Nonconvex Optimization
Symmetric Factorization for Nonconvex Optimization Qinqing Zheng February 24, 2017 1 Overview A growing body of recent research is shedding new light on the role of nonconvex optimization for tackling
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationIEOR 265 Lecture 3 Sparse Linear Regression
IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M
More informationDense Error Correction for Low-Rank Matrices via Principal Component Pursuit
Dense Error Correction for Low-Rank Matrices via Principal Component Pursuit Arvind Ganesh, John Wright, Xiaodong Li, Emmanuel J. Candès, and Yi Ma, Microsoft Research Asia, Beijing, P.R.C Dept. of Electrical
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationSolution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions
Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Yin Zhang Technical Report TR05-06 Department of Computational and Applied Mathematics Rice University,
More informationSparse Principal Component Analysis Formulations And Algorithms
Sparse Principal Component Analysis Formulations And Algorithms SLIDE 1 Outline 1 Background What Is Principal Component Analysis (PCA)? What Is Sparse Principal Component Analysis (spca)? 2 The Sparse
More informationExact Low-rank Matrix Recovery via Nonconvex M p -Minimization
Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization Lingchen Kong and Naihua Xiu Department of Applied Mathematics, Beijing Jiaotong University, Beijing, 100044, People s Republic of China E-mail:
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationStructured matrix factorizations. Example: Eigenfaces
Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix
More informationCompressed Sensing and Robust Recovery of Low Rank Matrices
Compressed Sensing and Robust Recovery of Low Rank Matrices M. Fazel, E. Candès, B. Recht, P. Parrilo Electrical Engineering, University of Washington Applied and Computational Mathematics Dept., Caltech
More informationA tutorial on sparse modeling. Outline:
A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing
More informationSPARSE signal representations have gained popularity in recent
6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying
More informationLecture 26: April 22nd
10-725/36-725: Conve Optimization Spring 2015 Lecture 26: April 22nd Lecturer: Ryan Tibshirani Scribes: Eric Wong, Jerzy Wieczorek, Pengcheng Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept.
More informationSemidefinite Programming
Semidefinite Programming Notes by Bernd Sturmfels for the lecture on June 26, 208, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra The transition from linear algebra to nonlinear algebra has
More informationSelf-Calibration and Biconvex Compressive Sensing
Self-Calibration and Biconvex Compressive Sensing Shuyang Ling Department of Mathematics, UC Davis July 12, 2017 Shuyang Ling (UC Davis) SIAM Annual Meeting, 2017, Pittsburgh July 12, 2017 1 / 22 Acknowledgements
More informationLecture Note 5: Semidefinite Programming for Stability Analysis
ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State
More informationarxiv:submit/ [cs.it] 16 Dec 2012
Simultaneously Structured Models with Application to Sparse and Low-rank Matrices arxiv:submit/0614913 [cs.it] 16 Dec 01 Samet Oymak c, Amin Jalali w, Maryam Fazel w, Yonina C. Eldar t, Babak Hassibi c,
More informationInformation-Theoretic Limits of Matrix Completion
Information-Theoretic Limits of Matrix Completion Erwin Riegler, David Stotz, and Helmut Bölcskei Dept. IT & EE, ETH Zurich, Switzerland Email: {eriegler, dstotz, boelcskei}@nari.ee.ethz.ch Abstract We
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationarxiv: v3 [stat.me] 8 Jun 2018
Between hard and soft thresholding: optimal iterative thresholding algorithms Haoyang Liu and Rina Foygel Barber arxiv:804.0884v3 [stat.me] 8 Jun 08 June, 08 Abstract Iterative thresholding algorithms
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationThe Stability of Low-Rank Matrix Reconstruction: a Constrained Singular Value Perspective
Forty-Eighth Annual Allerton Conference Allerton House UIUC Illinois USA September 9 - October 1 010 The Stability of Low-Rank Matrix Reconstruction: a Constrained Singular Value Perspective Gongguo Tang
More informationHigh Dimensional Inverse Covariate Matrix Estimation via Linear Programming
High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω
More informationESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING
Submitted to the Annals of Statistics ESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING By Sahand Negahban and Martin J. Wainwright University of California, Berkeley We study
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More informationCovariance Sketching via Quadratic Sampling
Covariance Sketching via Quadratic Sampling Yuejie Chi Department of ECE and BMI The Ohio State University Tsinghua University June 2015 Page 1 Acknowledgement Thanks to my academic collaborators on some
More information