Sparse Optimization Lecture: Sparse Recovery Guarantees

Size: px

Start display at page:

Download "Sparse Optimization Lecture: Sparse Recovery Guarantees"

Jade Gwen Berry
5 years ago
Views:

1 Those who complete this lecture will know Sparse Optimization Lecture: Sparse Recovery Guarantees Sparse Optimization Lecture: Sparse Recovery Guarantees Instructor: Wotao Yin Department of Mathematics, UCLA July 2013 Note scriber: Zheng Sun online discussions on piazza.com how to read different recovery guarantees some well-known conditions for exact and stable recovery such as Spark, coherence, RIP, NSP, etc. indeed, we can trust l1-minimization for recovering sparse vectors Instructor: Wotao Yin Department of Mathematics, UCLA July 2013 Note scriber: Zheng Sun online discussions on piazza.com Those who complete this lecture will know how to read different recovery guarantees some well-known conditions for exact and stable recovery such as Spark, coherence, RIP, NSP, etc. indeed, we can trust l 1 -minimization for recovering sparse vectors 1 / 35

2 The basic question of sparse optimization is: Can I trust my model to return an intended sparse quantity? The basic question of sparse optimization is: Can I trust my model to return an intended sparse quantity? That is does my model have a unique solution? (otherwise, different algorithms may return different answers) is the solution exactly equal to the original sparse quantity? if not (due to noise), is the solution a faithful approximate of it? how much effort is needed to numerically solve the model? This lecture provides brief answers to the first three questions. That is does my model have a unique solution? (otherwise, different algorithms may return different answers) is the solution exactly equal to the original sparse quantity? if not (due to noise), is the solution a faithful approximate of it? how much effort is needed to numerically solve the model? This lecture provides brief answers to the first three questions. 2 / 35

3 What this lecture does and does not cover What this lecture does and does not cover It covers basic sparse vector recovery guarantees based on spark What this lecture does and does not cover coherence restricted isometry property (RIP) and null-space property (NSP) as well as both exact and robust recovery guarantees. It does not cover the recovery of matrices, subspaces, etc. Recovery guarantees are important parts of sparse optimization, but they are not the focus of this summer course. It covers basic sparse vector recovery guarantees based on spark coherence restricted isometry property (RIP) and null-space property (NSP) as well as both exact and robust recovery guarantees. It does not cover the recovery of matrices, subspaces, etc. Recovery guarantees are important parts of sparse optimization, but they are not the focus of this summer course. 3 / 35

4 Examples of guarantees Examples of guarantees Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) For Ax = b where A R m n has full rank, if x satisfies x 0 1 (1 + 2 µ(a) 1 ), then l1-minimization recovers this x. Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) For Ax = b where A R m n has full rank, if x satisfies x 0 1 (1 + 2 µ(a) 1 ), then l 1-minimization recovers this x. Examples of guarantees Theorem (Candes and Tao [2005]) If x is k-sparse and A satisfies the RIP-based condition δ2k + δ3k < 1, then x is the l1-minimizer. Theorem (Zhang [2008]) IF A R m n is a standard Gaussian matrix, then with probability at least 1 exp( c0(n m)) l1-minimization is equivalent to l0-minimization for all x: x 0 < c2 1 m log(n/m) where c0, c1 > 0 are constants independent of m and n. Theorem (Candes and Tao [2005]) If x is k-sparse and A satisfies the RIP-based condition δ 2k + δ 3k < 1, then x is the l 1-minimizer. Theorem (Zhang [2008]) IF A R m n is a standard Gaussian matrix, then with probability at least 1 exp( c 0(n m)) l 1-minimization is equivalent to l 0-minimization for all x: x 0 < c2 1 4 m 1 + log(n/m) where c 0, c 1 > 0 are constants independent of m and n. 4 / 35

5 How to read guarantees Some basic aspects that distinguish different types of guarantees: Recoverability (exact) vs stability (inexact) General A or special A? Universal (all sparse vectors) or instance (certain sparse vector(s))? General optimality? or specific to model / algorithm? Required property of A: spark, RIP, coherence, NSP, dual certificate? If randomness is involved, what is its role? Condition/bound is tight or not? Absolute or in order of magnitude? How to read guarantees How to read guarantees Some basic aspects that distinguish different types of guarantees: Recoverability (exact) vs stability (inexact) General A or special A? Universal (all sparse vectors) or instance (certain sparse vector(s))? General optimality? or specific to model / algorithm? Required property of A: spark, RIP, coherence, NSP, dual certificate? If randomness is involved, what is its role? Condition/bound is tight or not? Absolute or in order of magnitude? 5 / 35

6 Spark Spark First questions for finding the sparsest solution to Ax = b 1. Can sparsest solution be unique? Under what conditions? 2. Given a sparse x, how to verify whether it is actually the sparsest one? First questions for finding the sparsest solution to Ax = b Spark Definition (Donoho and Elad [2003]) The spark of a given matrix A is the smallest number of columns from A that are linearly dependent, written as spark(a). rank(a) is the largest number of columns from A that are linearly independent. In general, spark(a) rank(a) + 1; except for many randomly generated matrices. Rank is easy to compute (due to the matroid structure), but spark needs a combinatorial search. 1. Can sparsest solution be unique? Under what conditions? 2. Given a sparse x, how to verify whether it is actually the sparsest one? Definition (Donoho and Elad [2003]) The spark of a given matrix A is the smallest number of columns from A that are linearly dependent, written as spark(a). rank(a) is the largest number of columns from A that are linearly independent. In general, spark(a) rank(a) + 1; except for many randomly generated matrices. Example: if in A = [a 1,..., a n], a i 0 for all i and a i = a j for some i j, then spark(a) = 2. The definition of spark implies: any set of columns of A will be linearly independent if its cardinality is strictly less than spark(a). spark(a) rank(a) + 1. There are matrices with a small spark but a large rank. Example: Changing one column of an n n non-singular square matrix A into zero, n 1. Then rank(a) = n 1 and spark(a) = 1. Rank is easy to compute (due to the matroid structure), but spark needs a combinatorial search. 6 / 35

7 Spark Spark Theorem (Gorodnitsky and Rao [1997]) If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest solution. Theorem (Gorodnitsky and Rao [1997]) If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest Spark Proof idea: if there is a solution y to Ax = b and x y 0, then A(x y) = 0 and thus x 0 + y 0 x y 0 spark(a) or y 0 spark(a) x 0 > spark(a)/2 > x 0. The result does not mean this x can be efficiently found numerically. For many random matrices A R m n, the result means that if an algorithm returns x satisfying x 0 < (m + 1)/2, the x is optimal with probability 1. What to do when spark(a) is difficult to obtain? solution. Proof idea: if there is a solution y to Ax = b and x y 0, then A(x y) = 0 and thus x 0 + y 0 x y 0 spark(a) or y 0 spark(a) x 0 > spark(a)/2 > x 0. In A(x y) = 0, view x y as the vector consisting of the linear-combination coefficients. Then, A(x y) = 0 means that the columns in A corresponding to those in the support of x y are linearly dependent, which implies x y 0 spark(a) by definition. Here we assume A R m n with m < n. Therefore rank(a) m, x 0 < spark(a)/2 (m + 1)/2. The result does not mean this x can be efficiently found numerically. For many random matrices A R m n, the result means that if an algorithm returns x satisfying x 0 < (m + 1)/2, the x is optimal with probability 1. What to do when spark(a) is difficult to obtain? 7 / 35

8 General Recovery - Spark General Recovery - Spark Rank is easy to compute, but spark needs a combinatorial search. However, for matrix with entries in general positions, spark(a) = rank(a) + 1. General Recovery - Spark For example, if matrix A R m n (m < n) has entries Aij N (0, 1), then rank(a) = m = spark(a) 1 with probability 1. In general, full rank matrix A R m n (m < n), any m + 1 columns of A is linearly dependent, so spark(a) m + 1 = rank(a) + 1. Rank is easy to compute, but spark needs a combinatorial search. However, for matrix with entries in general positions, spark(a) = rank(a) + 1. For example, if matrix A R m n (m < n) has entries A ij N (0, 1), then rank(a) = m = spark(a) 1 with probability 1. In general, full rank matrix A R m n (m < n), any m + 1 columns of A is linearly dependent, so spark(a) m + 1 = rank(a) + 1. To compute rank(a), we can grow a running subset C k of the columns of A such that A C k have independent columns for k = 1, 2,..., rank(a). Given k < rank(a) and C k, one can obtain C k+1 by finding an unpicked column A that is independent to those in A C k. So, C k is grown monotonically without replacement. However, it is easy to construct a matrix A such that all but one set of spark(a) columns are linearly independent. Therefore, one must enumerate all the combinations of find that set of linearly dependent columns in order to determine spark(a). For many random matrices, spark(a) = rank(a) + 1 is true. 8 / 35

9 Definition (Mallat and Zhang [1993]) Coherence The (mutual) coherence of a given matrix A is the largest absolute normalized inner product between different columns from A. Suppose A = [a 1 a 2 a n]. The mutual coherence of A is given by a k a j µ(a) = max. k,j,k j a k 2 a j 2 It characterizes the dependence between columns of A For unitary matrices, µ(a) = 0 For matrices with more columns than rows, µ(a) > 0 For recovery problems, we desire a small µ(a) as it is similar to unitary matrices. For A = [Φ Ψ] where Φ and Ψ are n n unitary, it holds n 1/2 µ(a) 1 µ(a) = n 1/2 is achieved with [I F], [I Hadamard], etc. if A R m n where n > m, then µ(a) m 1/2. Coherence Definition (Mallat and Zhang [1993]) Coherence The (mutual) coherence of a given matrix A is the largest absolute normalized inner product between different columns from A. Suppose A = [a1 a2 an]. The mutual coherence of A is given by µ(a) = max k,j,k j a k aj. ak 2 aj 2 It characterizes the dependence between columns of A For unitary matrices, µ(a) = 0 For matrices with more columns than rows, µ(a) > 0 For recovery problems, we desire a small µ(a) as it is similar to unitary matrices. For A = [Φ Ψ] where Φ and Ψ are n n unitary, it holds n 1/2 µ(a) 1 µ(a) = n 1/2 is achieved with [I F], [I Hadamard], etc. if A R m n where n > m, then µ(a) m 1/2. µ(a) = cos min k,j,k j (a k, a j ) cos < a k, a j >, the cosine of the minimum angle between two distinct columns of A. A is unitary column vectors are orthonormal the minimum angle is 90 degrees µ(a) = 0. The columns of Φ establish a cartesian coordinate system, to minimize µ(a), columns of Ψ should be far away from Φ. Figure: An illustration for 2-D 9 / 35

10 Coherence Coherence Theorem (Donoho and Elad [2003]) spark(a) 1 + µ 1 (A). Theorem (Donoho and Elad [2003]) Coherence Proof sketch: Ā A with columns normalized to unit 2-norm p spark(a) B a p p minor of Ā Ā Bii = 1 and j i Bij (p 1)µ(A) Suppose p < 1 + µ 1 (A) 1 > (p 1)µ(A) Bii > j i Bij, i B 0 (Gershgorin circle theorem) spark(a) > p. Contradiction. spark(a) 1 + µ 1 (A). Proof sketch: Ā A with columns normalized to unit 2-norm p spark(a) Gershgorin circle theorem: Every eigenvalue of A lies within at least one of the Gershgorin discs D(A ii, Ri). Here, A = (A ij ) n n, R i = i j A ij,d(a ii, R i ) = {x : x A ii R i }. B ii > j i B ij for i=1,2,...n implies that the gram matrix B is diagonally dominant and has no zero eigenvalue. Therefore B is positive definite. B a p p minor of Ā Ā B ii = 1 and j i Bij (p 1)µ(A) Suppose p < 1 + µ 1 (A) 1 > (p 1)µ(A) B ii > j i Bij, i B 0 (Gershgorin circle theorem) spark(a) > p. Contradiction. 10 / 35

11 Coherence-base guarantee Coherence-base guarantee Corollary If Ax = b has a solution x obeying x 0 < (1 + µ 1 (A))/2, then x is the unique sparsest solution. Corollary If Ax = b has a solution x obeying x 0 < (1 + µ 1 (A))/2, then x is the Coherence-base guarantee Compare with the previous Theorem If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest solution. For A R m n where m < n, (1 + µ 1 (A)) is at most 1 + m but spark can be 1 + m. spark is more useful. Assume Ax = b has a solution with x 0 = k < spark(a)/2. It will be the unique l0 minimizer. Will it be the l1 minimizer as well? Not necessarily. However, x 0 < (1 + µ 1 (A))/2 is a sufficient condition. unique sparsest solution. Compare with the previous Theorem If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest solution. The spark-based condition allow less sparse x (i.e., with more nonzero entries) compared to the one based on (1 + µ 1 (A))/2. The spark-based condition is less restrictive. For A R m n where m < n, (1 + µ 1 (A)) is at most 1 + m but spark can be 1 + m. spark is more useful. Assume Ax = b has a solution with x 0 = k < spark(a)/2. It will be the unique l 0 minimizer. Will it be the l 1 minimizer as well? Not necessarily. However, x 0 < (1 + µ 1 (A))/2 is a sufficient condition. 11 / 35

Coherence-based l 0 = l 1 Coherence-based l0 = l1 Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) If A has normalized columns and Ax = b has a solution x satisfying x 0 < 1 ( 1 + µ 1

12 Coherence-based l 0 = l 1 Coherence-based l0 = l1 Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) If A has normalized columns and Ax = b has a solution x satisfying x 0 < 1 ( 1 + µ 1 (A) ), 2 Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) If A has normalized columns and Ax = b has a solution x satisfying Coherence-based l 0 = l 1 then this x is the unique minimizer with respect to both l0 and l1. Proof sketch: Previously we know x is the unique l0 minimizer; let S := supp(x) Suppose y is the l1 minimizer but not x; we study e := y x e must satisfy Ae = 0 and e 1 2 es 1 A Ae = 0 ej (1 + µ(a)) 1 µ(a) e 1, j the last two points together contradict the assumption Result bottom line: allow x 0 up to O( m) for exact recovery x 0 < 1 2 ( 1 + µ 1 (A) ), then this x is the unique minimizer with respect to both l 0 and l 1. Definition: e S is the subvector of e formed by its entries in the index set S Therefore, we can use basis pursuit min{ x 1 : Ax = b} to find the sparsest solution Ax = b, solving the nonconvex l 0 problem with convex l 1 optimization. Proof sketch: Previously we know x is the unique l 0 minimizer; let S := supp(x) Suppose y is the l 1 minimizer but not x; we study e := y x e must satisfy Ae = 0 and e 1 2 e S 1 A Ae = 0 e j (1 + µ(a)) 1 µ(a) e 1, j the last two points together contradict the assumption Figure: finding the sparsest solution to Ax = b by l 0 and l 1 minimization Result bottom line: allow x 0 up to O( m) for exact recovery 12 / 35

13 The null space of A Definition: x p := ( i xi p) 1/p. The null space of A The null space of A Definition: x p := ( i xi p) 1/p. Lemma: Let 0 < p 1. If (y x) S p > (y x)s p then x p < y p. Proof: Let e := y x. y p p = x + e p p = xs + es p p + e S p p = x p p + ( e S p p es p p) + ( xs + es p p xs p p + es p p). Last term is nonnegative for 0 < p 1. So, a sufficient condition is e S p p > es p p. If the condition holds for 0 < p 1, it also holds for q (0, p]. Definition (null space property NSP(k, γ)). Every nonzero e N (A) satisfies es 1 < γ e S 1 for all index sets S with S k. Lemma: Let 0 < p 1. If (y x) S p > (y x) S p then x p < y p. Proof: Let e := y x. y p p = x + e p p = x S + e S p p + e S p p = x p p + ( e S p p e S p p) + ( x S + e S p p x S p p + e S p p). Last term is nonnegative for 0 < p 1. So, a sufficient condition is e S p p > e S p p. Some details: y p p = y i p = i S y i p + i S y i p = y S p p + y S p p = x S + e S p p + x S + e S p p = x S + e S p p + e S p p x S + e S p p x S p p + e S p p is nonnegative due to the triangle inequality. NSP focuses on l 1 norm. The larger k is, the more entries in e S, and thus the harder for the inequality to hold or the larger γ has to be. If the condition holds for 0 < p 1, it also holds for q (0, p]. Definition (null space property NSP(k, γ)). Every nonzero e N (A) satisfies e S 1 < γ e S 1 for all index sets S with S k. 13 / 35

14 The null space of A Theorem (Donoho and Huo [2001], Gribonval and Nielsen [2003]) Basis pursuit min{ x 1 : Ax = b} uniquely recovers all k-sparse vectors x o from measurements b = Ax o if and only if A satisfies NSP(k, 1). Proof. Sufficiency. Pick any k-sparse vector x o. Let S := supp(x o ) and S = S c. For any non-zero h N (A), we have A(x o + h) = Ax o = b and The null space of A The null space of A Theorem (Donoho and Huo [2001], Gribonval and Nielsen [2003]) Basis pursuit min{ x 1 : Ax = b} uniquely recovers all k-sparse vectors x o from measurements b = Ax o if and only if A satisfies NSP(k, 1). Proof. Sufficiency. Pick any k-sparse vector x o. Let S := supp(x o ) and S = S c. For any non-zero h N (A), we have A(x o + h) = Ax o = b and x 0 + h 1 = x 0 S + hs 1 + h S 1 x 0 S 1 hs 1 + h S 1 (1) = x ( h S 1 hs 1). NSP(k, 1) of A guarantees x 0 + h 1 > x 0 1, so x o is the unique solution. Necessity. The inequality (1) holds with equality if sign(x o S) = sign(hs) and hs has a sufficiently small scale. Therefore, basis pursuit to uniquely recovers all k-sparse vectors x o, NSP(k, 1) is also necessary. We establish b by using an already known k-sparse signal x 0, and then use the basis pursuit to recover a signal x. The theorem claims x = x 0 for all possible k-sparse x 0 if and only if A satisfies NSP(k, 1). x 0 + h 1 = x 0 S + h S 1 + h S 1 x 0 S 1 h S 1 + h S 1 (1) = x ( h S 1 h S 1). NSP(k, 1) of A guarantees x 0 + h 1 > x 0 1, so x o is the unique solution. Necessity. The inequality (1) holds with equality if sign(x o S) = sign(h S) and h S has a sufficiently small scale. Therefore, basis pursuit to uniquely recovers all k-sparse vectors x o, NSP(k, 1) is also necessary. 14 / 35

15 The null space of A The null space of A Another sufficient condition (Zhang [2008]) for x 1 < y 1 is The null space of A x 0 < 1 ( ) 2 y x 1. 4 y x 2 Proof: es 1 S es 2 S e 2 = x 0 e 2. Then, the above sufficient condition y x 1 > 2 (y x)s 1 is given the above inequality. Another sufficient condition (Zhang [2008]) for x 1 < y 1 is ( ) 2 y x 1. x 0 < 1 4 y x 2 Proof: e S 1 S e S 2 S e 2 = x 0 e 2. Then, the above sufficient condition y x 1 > 2 (y x) S 1 is given the above inequality. 15 / 35

16 Theorem (Zhang [2008]) Given x and b = Ax, Null space min x 1 s.t. Ax = b Null space Theorem (Zhang [2008]) Given x and b = Ax, recovers x uniquely if { 1 x 0 < min 4 Comments: Null space min x 1 s.t. Ax = b e 2 1 e 2 2 } : e N (A) \ {0}. We know 1 e 1/ e 2 n for all e 0. The ratio is small for sparse vectors but we want it large, i.e., close to n and away from 1. Fact: in most subspaces, the ratio is away from 1 In particular, Kashin, Garvaev, and Gluskin showed that a randomly drawn (n m)-dimensional subspace V satisfies e 1 e 2 c1 m, e V, e log(n/m) with probability at least 1 exp( c0(n m)), where c0, c1 > 0 are independent of m and n. recovers x uniquely if { 1 e 2 1 x 0 < min 4 e 2 2 Comments: } : e N (A) \ {0}. We know 1 e 1/ e 2 n for all e 0. The ratio is small for sparse vectors but we want it large, i.e., close to n and away from 1. Fact: in most subspaces, the ratio is away from 1 In particular, Kashin, Garvaev, and Gluskin showed that a randomly drawn (n m)-dimensional subspace V satisfies e 1 c 1 m, e V, e 0 e log(n/m) with probability at least 1 exp( c 0(n m)), where c 0, c 1 > 0 are independent of m and n. 16 / 35 The coherence-based theorem allows x 0 up tp O( m), while this theorem allows x 0 O(m/log(n/m)), which is typically larger. Fixing sparsity k and signal dimension n, we obtain the restriction for the number of measurementsm :m O(k log(n/k)), which tells us how many measurements are required for recovery.

17 Null space Null space Theorem (Zhang [2008]) Null space If A R m n is sampled from i.i.d. Gaussian or is any rank-m matrix such that BA = 0 and B R (n m) m is i.i.d. Gaussian, then with probability at least 1 exp( c0(n m)), l1 minimization recovers any sparse x if x 0 < c2 1 4 m 1 + log(n/m), where c0, c1 are positive constants independent of m and n. Theorem (Zhang [2008]) If A R m n is sampled from i.i.d. Gaussian or is any rank-m matrix such that BA = 0 and B R (n m) m is i.i.d. Gaussian, then with probability at least 1 exp( c 0(n m)), l 1 minimization recovers any sparse x if x 0 < c2 1 4 m 1 + log(n/m), where c 0, c 1 are positive constants independent of m and n. 17 / 35

18 Comments on NSP NSP is no longer necessary if for all k-sparse vectors is relaxed. Comments on NSP Comments on NSP NSP is no longer necessary if for all k-sparse vectors is relaxed. NSP is widely used in the proofs of other guarantees. NSP of order 2k is necessary for stable universal recovery. Consider an arbitrary decoder, tractable or not, that returns a vector from the input b = Ax o. If one requires to be stable in the sense x o (Ax o ) 1 < C σ[k](x o ) for all x o and σ[k] is the best k-term approximation error, then it holds hs 1 < C hs c 1, for all non-zero h N (A) and all coordinate sets S with S 2k. See Cohen, Dahmen, and DeVore [2006]. NSP is widely used in the proofs of other guarantees. NSP of order 2k is necessary for stable universal recovery. Consider an arbitrary decoder, tractable or not, that returns a vector from the input b = Ax o. If one requires to be stable in the sense If the decoder is stable, then matrix A must satisfy NSP(2k, C) property. h S 1 < C h S c 1 indicates that the energy of any vector in the null space couldn t be too concentrated. x o (Ax o ) 1 < C σ [k] (x o ) for all x o and σ [k] is the best k-term approximation error, then it holds h S 1 < C h S c 1, for all non-zero h N (A) and all coordinate sets S with S 2k. See Cohen, Dahmen, and DeVore [2006]. 18 / 35

Restricted isometry property (RIP) Restricted isometry property (RIP) Restricted isometry property (RIP) Definition (Candes and Tao [2005]) Matrix A obeys the restricted isometry property (RIP) with

19 Restricted isometry property (RIP) Restricted isometry property (RIP) Restricted isometry property (RIP) Definition (Candes and Tao [2005]) Matrix A obeys the restricted isometry property (RIP) with constant δs if (1 δs) c 2 2 Ac 2 2 (1 + δs) c 2 2 for all s-sparse vectors c. RIP essentially requires that every set of columns with cardinality less than or equal to s behaves like an orthonormal system. Definition (Candes and Tao [2005]) Matrix A obeys the restricted isometry property (RIP) with constant δ s if When δ S = 0, A S is an isometry. for all s-sparse vectors c. (1 δ s) c 2 2 Ac 2 2 (1 + δ s) c 2 2 RIP essentially requires that every set of columns with cardinality less than or equal to s behaves like an orthonormal system. Figure: e 2 = Ae 2 If A doesn t reserve the norm well, two planes may collapse (or nearly collapse) into a single one. Under this measurement, the solution cannot be recovered. Because for each vector on the collapsed image, we have no idea which plane it comes from. 19 / 35 The smaller δ S is, often the quicker the recovering algorithm will be.

20 RIP RIP Theorem (Candes and Tao [2006]) If x is k-sparse and A satisfies δ2k + δ3k < 1, then x is the unique l1 minimizer. Comments: Theorem (Candes and Tao [2006]) If x is k-sparse and A satisfies δ 2k + δ 3k < 1, then x is the unique l 1 minimizer. RIP RIP needs a matrix to be properly scaled the tight RIP constant of a given matrix A is difficult to compute the result is universal for all k-sparse tighter conditions (see next slide) all methods (including l0) require δ2k < 1 for universal recovery; every k-sparse x is unique if δ2k < 1 the requirement can be satisfied by certain A (e.g., whose entries are i.i.d samples following a subgaussian distribution) and lead to exact recovery for x 0 = O(m/ log(m/k)). Comments: RIP needs a matrix to be properly scaled the tight RIP constant of a given matrix A is difficult to compute the result is universal for all k-sparse tighter conditions (see next slide) all methods (including l 0) require δ 2k < 1 for universal recovery; every k-sparse x is unique if δ 2k < 1 the requirement can be satisfied by certain A (e.g., whose entries are i.i.d samples following a subgaussian distribution) and lead to exact recovery for x 0 = O(m/ log(m/k)). 20 / 35

21 More Comments More Comments More Comments (Foucart-Lai) If δ2k+2 < 1, then a sufficiently small p so that lp minimization is guaranteed to recovery any k-sparse x (Candes) δ2k < 2 1 is sufficient (Foucart-Lai) δ2k < 2(3 2)/ is sufficient RIP gives κ(as) (1 + δk)/(1 δk), S k; so δ2k < 2(3 2)/7 gives κ(as) 1.7, S 2m, very well-conditioned. (Mo-Li) δ2k < is sufficient (Cai-Wang-Xu) δk < is sufficient (Cai-Zhang) δk < 1/3 is sufficient and necessary for universal l1 recovery (Foucart-Lai) If δ 2k+2 < 1, then a sufficiently small p so that l p minimization is guaranteed to recovery any k-sparse x (Candes) δ 2k < 2 1 is sufficient (Foucart-Lai) δ 2k < 2(3 2)/ is sufficient RIP gives κ(a S) (1 + δ k )/(1 δ k ), S k; so δ 2k < 2(3 2)/7 gives κ(a S) 1.7, S 2m, very well-conditioned. (Mo-Li) δ 2k < is sufficient (Cai-Wang-Xu) δ k < is sufficient (Cai-Zhang) δ k < 1/3 is sufficient and necessary for universal l 1 recovery 21 / 35

22 Random matrices with RIPs Trivial randomly constructed matrices satisfy RIPs with overwhelming probability. Random matrices with RIPs Random matrices with RIPs Trivial randomly constructed matrices satisfy RIPs with overwhelming probability. Gaussian: Aij N(0, 1/m), x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the singular values of Gaussian matrices (Szarek-91,Davidson-Szarek-01). Bernoulli: Aij ±1 wp 1/2, x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the smallest singular value of a subgaussian matrix (Candes-Tao-04,Litvak-Pajor-Rudelson-TomczakJaegermann-04). Fourier ensemble: A C m n is a randomly chosen submatrix of discrete Fourier transform F C n n. Candes-Tao shows x 0 O(m/ log(n) 6 ) whp; Rudelson-Vershynin shows x 0 O(m/ log(n) 4 ); conjectured x 0 O(m/ log(n)).... Gaussian: A ij N(0, 1/m), x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the singular values of Gaussian matrices (Szarek-91,Davidson-Szarek-01). Bernoulli distribution generates a 0-1 matrix, which in some situations is easy for hardware implementation. Bernoulli: A ij ±1 wp 1/2, x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the smallest singular value of a subgaussian matrix (Candes-Tao-04,Litvak-Pajor-Rudelson-TomczakJaegermann-04). Fourier ensemble: A C m n is a randomly chosen submatrix of discrete Fourier transform F C n n. Candes-Tao shows x 0 O(m/ log(n) 6 ) whp; Rudelson-Vershynin shows x 0 O(m/ log(n) 4 ); conjectured x 0 O(m/ log(n)) / 35

23 1 k,j n Incoherent Sampling Suppose (Φ, Ψ) is a pair of orthonormal bases of R n. Φ is used for sensing: A is a subset of rows of Φ Ψ is used to sparsely represent x: x = Ψα, α is sparse Definition The coherence between Φ and Ψ is Incoherent Sampling Incoherent Sampling Suppose (Φ, Ψ) is a pair of orthonormal bases of R n. Φ is used for sensing: A is a subset of rows of Φ Ψ is used to sparsely represent x: x = Ψα, α is sparse Definition The coherence between Φ and Ψ is µ(φ, Ψ) = n max φk, ψj Coherence is the largest correlation between any two elements of Φ and Ψ. If Φ and Ψ contains correlated elements, then µ(φ, Ψ) is large Otherwise, µ(φ, Ψ) is small From linear algebra, 1 µ(φ, Ψ) n. Mutual incoherence deals with the case where the signal is not sparse itself but become (nearly) sparse under a proper basis. That is, by changing the basis from I to Ψ, its energy becomes concentrated, and its support becomes small. µ(φ, Ψ) = n max φ k, ψ j 1 k,j n Coherence is the largest correlation between any two elements of Φ and Ψ. If Φ and Ψ contains correlated elements, then µ(φ, Ψ) is large Otherwise, µ(φ, Ψ) is small Figure: Changing the basis From linear algebra, 1 µ(φ, Ψ) n. 23 / 35

24 Incoherent Sampling Compressive sensing requires low coherent pairs. x is sparse under Ψ: x = Ψα, α is sparse x is measured as b Ax x is recovered from min α 1, s.t. AΨα = b Examples: Φ is spike basis φ k (t) = δ(t k) and Ψ is the Fourier basis ψ j(t) = n 1/2 e i 2π jt/n ; then µ(φ, Ψ) = 1, achieving max incoherence. Coherence between noiselets and Haar wavelets is 2. Coherence between noiselets and Baubechies D4 and D8 are 2.2 and 2.9, respectively. Random matrices are largely incoherent with any fixed basis Ψ. Randomly generated and orthonormalized Φ: w.h.p., the coherence between Φ and any fixed Ψ is about 2 log n. Similar results apply to random Gaussian or ±1 matrices. Bottom line: many random matrices are universally incoherent with any fixed Ψ w.h.p. some kind of random circulant matrix is universally incoherent with any fixed Ψ w.h.p. 24 / 35 Incoherent Sampling Incoherent Sampling Compressive sensing requires low coherent pairs. x is sparse under Ψ: x = Ψα, α is sparse x is measured as b Ax x is recovered from min α 1, s.t. AΨα = b Examples: Φ is spike basis φk(t) = δ(t k) and Ψ is the Fourier basis ψj(t) = n 1/2 e i 2π jt/n ; then µ(φ, Ψ) = 1, achieving max incoherence. Coherence between noiselets and Haar wavelets is 2. Coherence between noiselets and Baubechies D4 and D8 are 2.2 and 2.9, respectively. Random matrices are largely incoherent with any fixed basis Ψ. Randomly generated and orthonormalized Φ: w.h.p., the coherence between Φ and any fixed Ψ is about 2 log n. Similar results apply to random Gaussian or ±1 matrices. Bottom line: many random matrices are universally incoherent with any fixed Ψ w.h.p. some kind of random circulant matrix is universally incoherent with any Why do we expect a small µ(φ, Ψ)?(Incoherence) Suppose Φ is given. Take Ψ = Φ. A downsamples from Φ, that is A = PΦ T. Here, P is a 0-1 matrix, with one 1 in each row, and at most one 1 in each column. Thus, AΨα = b PΦ T Ψα = b Since Ψ is an orthnormal matrix, Φ T Ψ = I. we derive Pα = b, which indicates b is downsampled from α directly. If we lose any entries in the sampling, we will lose it forever. Thus P should be a permutation matrix. The signal α could not be compressed. fixed Ψ w.h.p.

25 Incoherent Sampling Incoherent Sampling Theorem (Candes and Romberg [2007]) Fix x and suppose x is k-sparse under basis Ψ with coefficients in uniformly random signs. Select m measurements in the Φ domain uniformly at random. If m O(µ 2 (Φ, Ψ)k log(n)), l1-minimization recovers x with high probability. Theorem (Candes and Romberg [2007]) Fix x and suppose x is k-sparse under basis Ψ with coefficients in uniformly Incoherent Sampling Comments: The result is not universal for all Ψ or all k-sparse x under Ψ. Only guaranteed for nearly all sign sequences x with a fixed support. Why seeing probability? Because there are special signals that are sparse in Ψ yet vanish at most places in the Φ domain. This result allows structured, as opposed to noise-like (random), matrices. Can be seen as an extension to Fourier CS. Bottom line: the smaller the coherence, the fewer the samples required. This matches numerical experience. random signs. Select m measurements in the Φ domain uniformly at random. If m O(µ 2 (Φ, Ψ)k log(n)), l 1-minimization recovers x with high probability. Comments: The result is not universal for all Ψ or all k-sparse x under Ψ. Only guaranteed for nearly all sign sequences x with a fixed support. Why seeing probability? Because there are special signals that are sparse in Ψ yet vanish at most places in the Φ domain. This result allows structured, as opposed to noise-like (random), matrices. Can be seen as an extension to Fourier CS. Bottom line: the smaller the coherence, the fewer the samples required. This matches numerical experience. 25 / 35

26 Robust Recovery Robust Recovery In order to be practically powerful, CS must deal with Robust Recovery nearly sparse signals measurement noise sometimes both Goal: To obtain accurate reconstructions from highly undersampled measurements, or in short, stable recovery. In order to be practically powerful, CS must deal with nearly sparse signals measurement noise sometimes both Goal: To obtain accurate reconstructions from highly undersampled measurements, or in short, stable recovery. 26 / 35

27 Stable l 1 Recovery Consider a sparse x Stable l1 Recovery noisy CS measurements b Ax + z, where z 2 ɛ Apply the BPDN model: min x 1 s.t. Ax b 2 ɛ. Consider a sparse x noisy CS measurements b Ax + z, where z 2 ɛ Stable l 1 Recovery Theorem Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x satisfying for some constant C. x x 2 C ɛ Proof sketch (using an overly sufficient RIP bound): Let e = x x. S = {i : largest 2k x (i)}. One can show e 2 C1 es 2 C2 Ae 2 C ɛ. 1st inequality essentially from es 1 > e S 1, 2nd inequality essentially from the RIP; 3rd inequality essentially from the constraint. Apply the BPDN model: min x 1 s.t. Ax b 2 ɛ. Theorem Assume (some bounds on δ k or δ 2k ). The solution of the BPDN model returns a solution x satisfying x x 2 C ɛ Here, z denotes the unknown noise in the measurements. The accuracy of the recovery is limited by the energy of noise. C 2 Ae 2 = C 2 Ax b (Ax b) 2 C 2 ( z 2 + (Ax b) 2 ) C ɛ. for some constant C. Proof sketch (using an overly sufficient RIP bound): Let e = x x. S = {i : largest 2k x (i) }. One can show e 2 C 1 e S 2 C 2 Ae 2 C ɛ. 1st inequality essentially from e S 1 > e S 1, 2nd inequality essentially from the RIP; 3rd inequality essentially from the constraint. 27 / 35

28 Stable l 1 Recovery Stable l1 Recovery Theorem Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x satisfying x x 2 C ɛ Theorem Assume (some bounds on δ k or δ 2k ). The solution of the BPDN model returns Stable l 1 Recovery for some constant C. Comments: The result is universal and more general than exact recovery; The error bound is order-optimal: knowing supp(x) will give C ɛ at best; x is almost as good as if one knows where the largest k entries are and directly measure them; C depends on k; when k violates the condition and gets too large, x x 2 will blow up. a solution x satisfying x x 2 C ɛ for some constant C. Comments: The result is universal and more general than exact recovery; The error bound is order-optimal: knowing supp(x) will give C ɛ at best; x is almost as good as if one knows where the largest k entries are and directly measure them; C depends on k; when k violates the condition and gets too large, x x 2 will blow up. 28 / 35

29 Stable l 1 Recovery Stable l1 Recovery Consider a nearly sparse x = xk + w, xk is the vector x with all but the largest (in magnitude) k entries set to 0, CS measurements b Ax + z, where z 2 ɛ. Consider a nearly sparse x = x k + w, Stable l 1 Recovery Theorem Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x satisfying x x 2 C ɛ + C k 1/2 x xk 1 for some constants C and C. Proof sketch (using an overly sufficient RIP bound): Similar to the previous one, except x is no longer k-sparse and es 1 > e S 1 is no longer valid. Instead, we get es x xk 1 > e S 1. Then, e 2 C1 e4k 2 + C k 1/2 x xk 1,... x k is the vector x with all but the largest (in magnitude) k entries set to 0, CS measurements b Ax + z, where z 2 ɛ. Theorem Assume (some bounds on δ k or δ 2k ). The solution of the BPDN model returns a solution x satisfying We could decompose x into a k-sparse signal x k and some noise w We improve the stability analysis by dealing with nearly sparse signal. The theorem holds since small entries could never be recovered. for some constants C and C. x x 2 C ɛ + C k 1/2 x x k 1 Proof sketch (using an overly sufficient RIP bound): Similar to the previous one, except x is no longer k-sparse and e S 1 > e S 1 is no longer valid. Instead, we get e S x x k 1 > e S 1. Then, e 2 C 1 e 4k 2 + C k 1/2 x x k 1,... Figure: Some part couldn t be recovered 29 / 35

30 Stable l 1 Recovery Comments on Stable l1 Recovery x x 2 C ɛ + C k 1/2 x xk 1. Stable l 1 Recovery Suppose ɛ = 0; let us focus on the last term: 1. Consider power-law decay signals x (i) C i r, r > 1; 2. Then, x xk 1 C1k r+1 or k 1/2 x xk 1 C1k r+(1/2) ; 3. But even if x = xk, x x 2 = xk x 2 C1k r+(1/2) ; 4. Conclusion: the bound cannot be fundamentally improved. Comments on x x 2 C ɛ + C k 1/2 x x k 1. Suppose ɛ = 0; let us focus on the last term: 1. Consider power-law decay signals x (i) C i r, r > 1; 2. Then, x x k 1 C 1k r+1 or k 1/2 x x k 1 C 1k r+(1/2) ; 3. But even if x = x k, x x 2 = x k x 2 C 1k r+(1/2) ; 4. Conclusion: the bound cannot be fundamentally improved. C ɛ corresponds to the error brought by noise, and C k 1/2 x x k 1 corresponds to the small entries in the signal. Here, we use a special signal the power-law decaying signal, to illustrate the estimate of k 1/2 x x 1 is tight. That is, we could only find better C but cannot improve the order on this signal. 30 / 35

31 Information Theoretic Analysis Information Theoretic Analysis Question: is there an encoding-decoding means that can do fundamentally better than Gaussian A and l1-minimization? Information Theoretic Analysis In math: encoder-decoder pair (A, ), (Ax) x 2 O(k 1/2 σk(x)) holds for k larger than O(m/ log(n/m))? Comments: A can be any matrix, and can be any decoder, tractable or not. Let x xk 1 be called the best-k approximation error, denoted by σk(x) := x xk 1. Question: is there an encoding-decoding means that can do fundamentally better than Gaussian A and l 1-minimization? In math: encoder-decoder pair (A, ), (Ax) x 2 O(k 1/2 σ k (x)) holds for k larger than O(m/ log(n/m))? Random matrix (such as Gaussian matrices) RIP Stable recovery. The relationship between the rows of matrix and the sparsity of the signal is: m O(k log(n/k)) or k O(m/ log(n/m)) Comments: A can be any matrix, and can be any decoder, tractable or not. Let x x k 1 be called the best-k approximation error, denoted by σ k (x) := x x k / 35

32 (A, ) x K Performance of (A, ) where A R m n : Performance of (A, ) where A R m n : Gelfand width: d m (K) = E m(k) := inf (A, ) x K sup (Ax) x 2 inf sup { h 2 : h K Y }. codim(y ) m Cohen, Dahmen, and DeVore [2006]: If set K = K and K + K C 0K, then d m (K) E m(k) C 0d m (K). Gelfand width: d m (K) = Em(K) := inf sup (Ax) x 2 inf sup { h 2 : h K Y }. codim(y ) m Cohen, Dahmen, and DeVore [2006]: If set K = K and K + K C0K, then d m (K) Em(K) C0d m (K). E m(k) measures the biggest error for the best pair of encoder-decoder pair. We intend to show E m(k) has the order of O(m/ log(n/m)). That is: a carefully chosen pair cannot work fundamentally better than (Gaussian A, l 1 -minimization) when expecting a stable recovery. For K is a l 1 ball, the Gelfand width is the smallest radius of the section cut by the linear space whose codimension is no more than m. The smaller the Gelfand width, the less uncertainty there is in the recovery. 32 / 35

33 log(n/m) m Gelfand Width and K = l 1 -ball Kashin, Gluskin, Garnaev: for K = {h R n : h 1 1}, Gelfand Width and K = l 1 -ball Gelfand Width and K = l1-ball Kashin, Gluskin, Garnaev: for K = {h R n : h 1 1}, log(n/m) log(n/m) C1 d m (K) C2 m m. Consequences: 1. KGG means Em(K) 2. we want (Ax) x 2 C k 1/2 σk(x) C k 1/2 x 1; normalizing gives Em(K) C k 1/2. 3. Therefore, k m/ log(n/m). We cannot do better than this. C 1 log(n/m) m d m (K) C 2 log(n/m) m. here means the order of (O( )). Therefore O(m/ log(n/m)) is the loosest restriction for sparsity k, when a stable encoder-decoder pair is expected. Consequences: 1. KGG means E m(k) log(n/m) m 2. we want (Ax) x 2 C k 1/2 σ k (x) C k 1/2 x 1; normalizing gives E m(k) C k 1/2. 3. Therefore, k m/ log(n/m). We cannot do better than this. 33 / 35

34 (Incomplete) References: D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries vis l 1 minimization. Proceedings of the National Academy of Sciences, 100: , R. Gribonval and M. Nielsen. Sparse representations in unions of bases. IEEE Transactions on Information Theory, 49(12): , E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51: , Y. Zhang. Theory of compressive sensing via l1-minimization: a non-rip analysis and extensions. Rice University CAAM Technical Report TR08-11, I.F. Gorodnitsky and B.D. Rao. Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm. Signal Processing, IEEE Transactions on, 45(3): , S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12): , D. Donoho and X. Huo. Uncertainty principles and ideal atomic decompositions. IEEE Transactions on Information Theory, 47: , A. Cohen, W. Dahmen, and R. A. DeVore. Compressed sensing and best k-term approximation. Submitted., (Incomplete) References: D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries vis l1 minimization. Proceedings of the National Academy of Sciences, 100: , R. Gribonval and M. Nielsen. Sparse representations in unions of bases. IEEE Transactions on Information Theory, 49(12): , E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51: , Y. Zhang. Theory of compressive sensing via l1-minimization: a non-rip analysis and extensions. Rice University CAAM Technical Report TR08-11, I.F. Gorodnitsky and B.D. Rao. Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm. Signal Processing, IEEE Transactions on, 45(3): , S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12): , D. Donoho and X. Huo. Uncertainty principles and ideal atomic decompositions. IEEE Transactions on Information Theory, 47: , A. Cohen, W. Dahmen, and R. A. DeVore. Compressed sensing and best k-term approximation. Submitted., / 35

35 E. Candes and T. Tao. Near optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory, 52(1): , E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3): , E. Candes and T. Tao. Near optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory, 52(1): , E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3): , / 35

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Emmanuel Candes and Prof. Wotao Yin