Sparse Optimization Lecture: Sparse Recovery Guarantees

Similar documents
Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Reconstruction from Anisotropic Random Measurements

INDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Uniqueness Conditions for A Class of l 0 -Minimization Problems

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

THEORY OF COMPRESSIVE SENSING VIA l 1 -MINIMIZATION: A NON-RIP ANALYSIS AND EXTENSIONS

Compressed Sensing: Lecture I. Ronald DeVore

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Near Optimal Signal Recovery from Random Projections

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Lecture Notes 9: Constrained Optimization

Compressed Sensing and Sparse Recovery

Strengthened Sobolev inequalities for a random subspace of functions

The Sparsest Solution of Underdetermined Linear System by l q minimization for 0 < q 1

Solution Recovery via L1 minimization: What are possible and Why?

Introduction to Compressed Sensing

Necessary and Sufficient Conditions of Solution Uniqueness in 1-Norm Minimization

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Error Correction via Linear Programming

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery

Thresholds for the Recovery of Sparse Solutions via L1 Minimization

Optimisation Combinatoire et Convexe.

An Introduction to Sparse Approximation

Compressive Sensing Theory and L1-Related Optimization Algorithms

Recovering overcomplete sparse representations from structured sensing

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

Combining geometry and combinatorics

RSP-Based Analysis for Sparsest and Least l 1 -Norm Solutions to Underdetermined Linear Systems

Enhanced Compressive Sensing and More

AN INTRODUCTION TO COMPRESSIVE SENSING

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Z Algorithmic Superpower Randomization October 15th, Lecture 12

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Necessary and sufficient conditions of solution uniqueness in l 1 minimization

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

SPARSE signal representations have gained popularity in recent

Stable Signal Recovery from Incomplete and Inaccurate Measurements

6 Compressed Sensing and Sparse Recovery

Uniqueness and Uncertainty

A New Estimate of Restricted Isometry Constants for Sparse Solutions

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Sensing systems limited by constraints: physical size, time, cost, energy

The Sparsity Gap. Joel A. Tropp. Computing & Mathematical Sciences California Institute of Technology

Sparse Recovery with Pre-Gaussian Random Matrices

Stochastic geometry and random matrix theory in CS

Recent Developments in Compressed Sensing

5406 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 12, DECEMBER 2006

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

A new method on deterministic construction of the measurement matrix in compressed sensing

Uniform Uncertainty Principle and Signal Recovery via Regularized Orthogonal Matching Pursuit

Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit

Constrained optimization

Uncertainty principles and sparse approximation

Invertibility of random matrices

Lecture 22: More On Compressed Sensing

Lecture 3. Random Fourier measurements

Does Compressed Sensing have applications in Robust Statistics?

Conditions for Robust Principal Component Analysis

Data Sparse Matrix Computation - Lecture 20

Rui ZHANG Song LI. Department of Mathematics, Zhejiang University, Hangzhou , P. R. China

A Structured Construction of Optimal Measurement Matrix for Noiseless Compressed Sensing via Polarization of Analog Transmission

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Abstract This paper is about the efficient solution of large-scale compressed sensing problems.

Sparse Solutions of an Undetermined Linear System

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery

Compressive Sensing with Random Matrices

Primal Dual Pursuit A Homotopy based Algorithm for the Dantzig Selector

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Interpolation via weighted l 1 -minimization

Robust Principal Component Analysis

COMPRESSED SENSING IN PYTHON

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

STAT 200C: High-dimensional Statistics

Sparse Interactions: Identifying High-Dimensional Multilinear Systems via Compressed Sensing

Inverse problems and sparse models (6/6) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France.

Stability and robustness of l 1 -minimizations with Weibull matrices and redundant dictionaries

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

GREEDY SIGNAL RECOVERY REVIEW

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

Signal Recovery from Permuted Observations

Sparsest Solutions of Underdetermined Linear Systems via l q -minimization for 0 < q 1

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Sparse Legendre expansions via l 1 minimization

Tractable performance bounds for compressed sensing.

Shifting Inequality and Recovery of Sparse Signals

Generalized Orthogonal Matching Pursuit- A Review and Some

Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information

Small Ball Probability, Arithmetic Structure and Random Matrices

Sparse and Low-Rank Matrix Decompositions

Tractable Upper Bounds on the Restricted Isometry Constant

Sparse and Low Rank Recovery via Null Space Properties

COMPRESSED Sensing (CS) is a method to recover a

of Orthogonal Matching Pursuit

ACCORDING to Shannon s sampling theorem, an analog

Transcription:

Those who complete this lecture will know Sparse Optimization Lecture: Sparse Recovery Guarantees Sparse Optimization Lecture: Sparse Recovery Guarantees Instructor: Wotao Yin Department of Mathematics, UCLA July 2013 Note scriber: Zheng Sun online discussions on piazza.com how to read different recovery guarantees some well-known conditions for exact and stable recovery such as Spark, coherence, RIP, NSP, etc. indeed, we can trust l1-minimization for recovering sparse vectors Instructor: Wotao Yin Department of Mathematics, UCLA July 2013 Note scriber: Zheng Sun online discussions on piazza.com Those who complete this lecture will know how to read different recovery guarantees some well-known conditions for exact and stable recovery such as Spark, coherence, RIP, NSP, etc. indeed, we can trust l 1 -minimization for recovering sparse vectors 1 / 35

The basic question of sparse optimization is: Can I trust my model to return an intended sparse quantity? The basic question of sparse optimization is: Can I trust my model to return an intended sparse quantity? That is does my model have a unique solution? (otherwise, different algorithms may return different answers) is the solution exactly equal to the original sparse quantity? if not (due to noise), is the solution a faithful approximate of it? how much effort is needed to numerically solve the model? This lecture provides brief answers to the first three questions. That is does my model have a unique solution? (otherwise, different algorithms may return different answers) is the solution exactly equal to the original sparse quantity? if not (due to noise), is the solution a faithful approximate of it? how much effort is needed to numerically solve the model? This lecture provides brief answers to the first three questions. 2 / 35

What this lecture does and does not cover What this lecture does and does not cover It covers basic sparse vector recovery guarantees based on spark What this lecture does and does not cover coherence restricted isometry property (RIP) and null-space property (NSP) as well as both exact and robust recovery guarantees. It does not cover the recovery of matrices, subspaces, etc. Recovery guarantees are important parts of sparse optimization, but they are not the focus of this summer course. It covers basic sparse vector recovery guarantees based on spark coherence restricted isometry property (RIP) and null-space property (NSP) as well as both exact and robust recovery guarantees. It does not cover the recovery of matrices, subspaces, etc. Recovery guarantees are important parts of sparse optimization, but they are not the focus of this summer course. 3 / 35

Examples of guarantees Examples of guarantees Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) For Ax = b where A R m n has full rank, if x satisfies x 0 1 (1 + 2 µ(a) 1 ), then l1-minimization recovers this x. Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) For Ax = b where A R m n has full rank, if x satisfies x 0 1 (1 + 2 µ(a) 1 ), then l 1-minimization recovers this x. Examples of guarantees Theorem (Candes and Tao [2005]) If x is k-sparse and A satisfies the RIP-based condition δ2k + δ3k < 1, then x is the l1-minimizer. Theorem (Zhang [2008]) IF A R m n is a standard Gaussian matrix, then with probability at least 1 exp( c0(n m)) l1-minimization is equivalent to l0-minimization for all x: x 0 < c2 1 m 4 1 + log(n/m) where c0, c1 > 0 are constants independent of m and n. Theorem (Candes and Tao [2005]) If x is k-sparse and A satisfies the RIP-based condition δ 2k + δ 3k < 1, then x is the l 1-minimizer. Theorem (Zhang [2008]) IF A R m n is a standard Gaussian matrix, then with probability at least 1 exp( c 0(n m)) l 1-minimization is equivalent to l 0-minimization for all x: x 0 < c2 1 4 m 1 + log(n/m) where c 0, c 1 > 0 are constants independent of m and n. 4 / 35

How to read guarantees Some basic aspects that distinguish different types of guarantees: Recoverability (exact) vs stability (inexact) General A or special A? Universal (all sparse vectors) or instance (certain sparse vector(s))? General optimality? or specific to model / algorithm? Required property of A: spark, RIP, coherence, NSP, dual certificate? If randomness is involved, what is its role? Condition/bound is tight or not? Absolute or in order of magnitude? How to read guarantees How to read guarantees Some basic aspects that distinguish different types of guarantees: Recoverability (exact) vs stability (inexact) General A or special A? Universal (all sparse vectors) or instance (certain sparse vector(s))? General optimality? or specific to model / algorithm? Required property of A: spark, RIP, coherence, NSP, dual certificate? If randomness is involved, what is its role? Condition/bound is tight or not? Absolute or in order of magnitude? 5 / 35

Spark Spark First questions for finding the sparsest solution to Ax = b 1. Can sparsest solution be unique? Under what conditions? 2. Given a sparse x, how to verify whether it is actually the sparsest one? First questions for finding the sparsest solution to Ax = b Spark Definition (Donoho and Elad [2003]) The spark of a given matrix A is the smallest number of columns from A that are linearly dependent, written as spark(a). rank(a) is the largest number of columns from A that are linearly independent. In general, spark(a) rank(a) + 1; except for many randomly generated matrices. Rank is easy to compute (due to the matroid structure), but spark needs a combinatorial search. 1. Can sparsest solution be unique? Under what conditions? 2. Given a sparse x, how to verify whether it is actually the sparsest one? Definition (Donoho and Elad [2003]) The spark of a given matrix A is the smallest number of columns from A that are linearly dependent, written as spark(a). rank(a) is the largest number of columns from A that are linearly independent. In general, spark(a) rank(a) + 1; except for many randomly generated matrices. Example: if in A = [a 1,..., a n], a i 0 for all i and a i = a j for some i j, then spark(a) = 2. The definition of spark implies: any set of columns of A will be linearly independent if its cardinality is strictly less than spark(a). spark(a) rank(a) + 1. There are matrices with a small spark but a large rank. Example: Changing one column of an n n non-singular square matrix A into zero, n 1. Then rank(a) = n 1 and spark(a) = 1. Rank is easy to compute (due to the matroid structure), but spark needs a combinatorial search. 6 / 35

Spark Spark Theorem (Gorodnitsky and Rao [1997]) If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest solution. Theorem (Gorodnitsky and Rao [1997]) If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest Spark Proof idea: if there is a solution y to Ax = b and x y 0, then A(x y) = 0 and thus x 0 + y 0 x y 0 spark(a) or y 0 spark(a) x 0 > spark(a)/2 > x 0. The result does not mean this x can be efficiently found numerically. For many random matrices A R m n, the result means that if an algorithm returns x satisfying x 0 < (m + 1)/2, the x is optimal with probability 1. What to do when spark(a) is difficult to obtain? solution. Proof idea: if there is a solution y to Ax = b and x y 0, then A(x y) = 0 and thus x 0 + y 0 x y 0 spark(a) or y 0 spark(a) x 0 > spark(a)/2 > x 0. In A(x y) = 0, view x y as the vector consisting of the linear-combination coefficients. Then, A(x y) = 0 means that the columns in A corresponding to those in the support of x y are linearly dependent, which implies x y 0 spark(a) by definition. Here we assume A R m n with m < n. Therefore rank(a) m, x 0 < spark(a)/2 (m + 1)/2. The result does not mean this x can be efficiently found numerically. For many random matrices A R m n, the result means that if an algorithm returns x satisfying x 0 < (m + 1)/2, the x is optimal with probability 1. What to do when spark(a) is difficult to obtain? 7 / 35

General Recovery - Spark General Recovery - Spark Rank is easy to compute, but spark needs a combinatorial search. However, for matrix with entries in general positions, spark(a) = rank(a) + 1. General Recovery - Spark For example, if matrix A R m n (m < n) has entries Aij N (0, 1), then rank(a) = m = spark(a) 1 with probability 1. In general, full rank matrix A R m n (m < n), any m + 1 columns of A is linearly dependent, so spark(a) m + 1 = rank(a) + 1. Rank is easy to compute, but spark needs a combinatorial search. However, for matrix with entries in general positions, spark(a) = rank(a) + 1. For example, if matrix A R m n (m < n) has entries A ij N (0, 1), then rank(a) = m = spark(a) 1 with probability 1. In general, full rank matrix A R m n (m < n), any m + 1 columns of A is linearly dependent, so spark(a) m + 1 = rank(a) + 1. To compute rank(a), we can grow a running subset C k of the columns of A such that A C k have independent columns for k = 1, 2,..., rank(a). Given k < rank(a) and C k, one can obtain C k+1 by finding an unpicked column A that is independent to those in A C k. So, C k is grown monotonically without replacement. However, it is easy to construct a matrix A such that all but one set of spark(a) columns are linearly independent. Therefore, one must enumerate all the combinations of find that set of linearly dependent columns in order to determine spark(a). For many random matrices, spark(a) = rank(a) + 1 is true. 8 / 35

Definition (Mallat and Zhang [1993]) Coherence The (mutual) coherence of a given matrix A is the largest absolute normalized inner product between different columns from A. Suppose A = [a 1 a 2 a n]. The mutual coherence of A is given by a k a j µ(a) = max. k,j,k j a k 2 a j 2 It characterizes the dependence between columns of A For unitary matrices, µ(a) = 0 For matrices with more columns than rows, µ(a) > 0 For recovery problems, we desire a small µ(a) as it is similar to unitary matrices. For A = [Φ Ψ] where Φ and Ψ are n n unitary, it holds n 1/2 µ(a) 1 µ(a) = n 1/2 is achieved with [I F], [I Hadamard], etc. if A R m n where n > m, then µ(a) m 1/2. Coherence Definition (Mallat and Zhang [1993]) Coherence The (mutual) coherence of a given matrix A is the largest absolute normalized inner product between different columns from A. Suppose A = [a1 a2 an]. The mutual coherence of A is given by µ(a) = max k,j,k j a k aj. ak 2 aj 2 It characterizes the dependence between columns of A For unitary matrices, µ(a) = 0 For matrices with more columns than rows, µ(a) > 0 For recovery problems, we desire a small µ(a) as it is similar to unitary matrices. For A = [Φ Ψ] where Φ and Ψ are n n unitary, it holds n 1/2 µ(a) 1 µ(a) = n 1/2 is achieved with [I F], [I Hadamard], etc. if A R m n where n > m, then µ(a) m 1/2. µ(a) = cos min k,j,k j (a k, a j ) cos < a k, a j >, the cosine of the minimum angle between two distinct columns of A. A is unitary column vectors are orthonormal the minimum angle is 90 degrees µ(a) = 0. The columns of Φ establish a cartesian coordinate system, to minimize µ(a), columns of Ψ should be far away from Φ. Figure: An illustration for 2-D 9 / 35

Coherence Coherence Theorem (Donoho and Elad [2003]) spark(a) 1 + µ 1 (A). Theorem (Donoho and Elad [2003]) Coherence Proof sketch: Ā A with columns normalized to unit 2-norm p spark(a) B a p p minor of Ā Ā Bii = 1 and j i Bij (p 1)µ(A) Suppose p < 1 + µ 1 (A) 1 > (p 1)µ(A) Bii > j i Bij, i B 0 (Gershgorin circle theorem) spark(a) > p. Contradiction. spark(a) 1 + µ 1 (A). Proof sketch: Ā A with columns normalized to unit 2-norm p spark(a) Gershgorin circle theorem: Every eigenvalue of A lies within at least one of the Gershgorin discs D(A ii, Ri). Here, A = (A ij ) n n, R i = i j A ij,d(a ii, R i ) = {x : x A ii R i }. B ii > j i B ij for i=1,2,...n implies that the gram matrix B is diagonally dominant and has no zero eigenvalue. Therefore B is positive definite. B a p p minor of Ā Ā B ii = 1 and j i Bij (p 1)µ(A) Suppose p < 1 + µ 1 (A) 1 > (p 1)µ(A) B ii > j i Bij, i B 0 (Gershgorin circle theorem) spark(a) > p. Contradiction. 10 / 35

Coherence-base guarantee Coherence-base guarantee Corollary If Ax = b has a solution x obeying x 0 < (1 + µ 1 (A))/2, then x is the unique sparsest solution. Corollary If Ax = b has a solution x obeying x 0 < (1 + µ 1 (A))/2, then x is the Coherence-base guarantee Compare with the previous Theorem If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest solution. For A R m n where m < n, (1 + µ 1 (A)) is at most 1 + m but spark can be 1 + m. spark is more useful. Assume Ax = b has a solution with x 0 = k < spark(a)/2. It will be the unique l0 minimizer. Will it be the l1 minimizer as well? Not necessarily. However, x 0 < (1 + µ 1 (A))/2 is a sufficient condition. unique sparsest solution. Compare with the previous Theorem If Ax = b has a solution x obeying x 0 < spark(a)/2, then x is the sparsest solution. The spark-based condition allow less sparse x (i.e., with more nonzero entries) compared to the one based on (1 + µ 1 (A))/2. The spark-based condition is less restrictive. For A R m n where m < n, (1 + µ 1 (A)) is at most 1 + m but spark can be 1 + m. spark is more useful. Assume Ax = b has a solution with x 0 = k < spark(a)/2. It will be the unique l 0 minimizer. Will it be the l 1 minimizer as well? Not necessarily. However, x 0 < (1 + µ 1 (A))/2 is a sufficient condition. 11 / 35

Coherence-based l 0 = l 1 Coherence-based l0 = l1 Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) If A has normalized columns and Ax = b has a solution x satisfying x 0 < 1 ( 1 + µ 1 (A) ), 2 Theorem (Donoho and Elad [2003], Gribonval and Nielsen [2003]) If A has normalized columns and Ax = b has a solution x satisfying Coherence-based l 0 = l 1 then this x is the unique minimizer with respect to both l0 and l1. Proof sketch: Previously we know x is the unique l0 minimizer; let S := supp(x) Suppose y is the l1 minimizer but not x; we study e := y x e must satisfy Ae = 0 and e 1 2 es 1 A Ae = 0 ej (1 + µ(a)) 1 µ(a) e 1, j the last two points together contradict the assumption Result bottom line: allow x 0 up to O( m) for exact recovery x 0 < 1 2 ( 1 + µ 1 (A) ), then this x is the unique minimizer with respect to both l 0 and l 1. Definition: e S is the subvector of e formed by its entries in the index set S Therefore, we can use basis pursuit min{ x 1 : Ax = b} to find the sparsest solution Ax = b, solving the nonconvex l 0 problem with convex l 1 optimization. Proof sketch: Previously we know x is the unique l 0 minimizer; let S := supp(x) Suppose y is the l 1 minimizer but not x; we study e := y x e must satisfy Ae = 0 and e 1 2 e S 1 A Ae = 0 e j (1 + µ(a)) 1 µ(a) e 1, j the last two points together contradict the assumption Figure: finding the sparsest solution to Ax = b by l 0 and l 1 minimization Result bottom line: allow x 0 up to O( m) for exact recovery 12 / 35

The null space of A Definition: x p := ( i xi p) 1/p. The null space of A The null space of A Definition: x p := ( i xi p) 1/p. Lemma: Let 0 < p 1. If (y x) S p > (y x)s p then x p < y p. Proof: Let e := y x. y p p = x + e p p = xs + es p p + e S p p = x p p + ( e S p p es p p) + ( xs + es p p xs p p + es p p). Last term is nonnegative for 0 < p 1. So, a sufficient condition is e S p p > es p p. If the condition holds for 0 < p 1, it also holds for q (0, p]. Definition (null space property NSP(k, γ)). Every nonzero e N (A) satisfies es 1 < γ e S 1 for all index sets S with S k. Lemma: Let 0 < p 1. If (y x) S p > (y x) S p then x p < y p. Proof: Let e := y x. y p p = x + e p p = x S + e S p p + e S p p = x p p + ( e S p p e S p p) + ( x S + e S p p x S p p + e S p p). Last term is nonnegative for 0 < p 1. So, a sufficient condition is e S p p > e S p p. Some details: y p p = y i p = i S y i p + i S y i p = y S p p + y S p p = x S + e S p p + x S + e S p p = x S + e S p p + e S p p x S + e S p p x S p p + e S p p is nonnegative due to the triangle inequality. NSP focuses on l 1 norm. The larger k is, the more entries in e S, and thus the harder for the inequality to hold or the larger γ has to be. If the condition holds for 0 < p 1, it also holds for q (0, p]. Definition (null space property NSP(k, γ)). Every nonzero e N (A) satisfies e S 1 < γ e S 1 for all index sets S with S k. 13 / 35

The null space of A Theorem (Donoho and Huo [2001], Gribonval and Nielsen [2003]) Basis pursuit min{ x 1 : Ax = b} uniquely recovers all k-sparse vectors x o from measurements b = Ax o if and only if A satisfies NSP(k, 1). Proof. Sufficiency. Pick any k-sparse vector x o. Let S := supp(x o ) and S = S c. For any non-zero h N (A), we have A(x o + h) = Ax o = b and The null space of A The null space of A Theorem (Donoho and Huo [2001], Gribonval and Nielsen [2003]) Basis pursuit min{ x 1 : Ax = b} uniquely recovers all k-sparse vectors x o from measurements b = Ax o if and only if A satisfies NSP(k, 1). Proof. Sufficiency. Pick any k-sparse vector x o. Let S := supp(x o ) and S = S c. For any non-zero h N (A), we have A(x o + h) = Ax o = b and x 0 + h 1 = x 0 S + hs 1 + h S 1 x 0 S 1 hs 1 + h S 1 (1) = x 0 1 + ( h S 1 hs 1). NSP(k, 1) of A guarantees x 0 + h 1 > x 0 1, so x o is the unique solution. Necessity. The inequality (1) holds with equality if sign(x o S) = sign(hs) and hs has a sufficiently small scale. Therefore, basis pursuit to uniquely recovers all k-sparse vectors x o, NSP(k, 1) is also necessary. We establish b by using an already known k-sparse signal x 0, and then use the basis pursuit to recover a signal x. The theorem claims x = x 0 for all possible k-sparse x 0 if and only if A satisfies NSP(k, 1). x 0 + h 1 = x 0 S + h S 1 + h S 1 x 0 S 1 h S 1 + h S 1 (1) = x 0 1 + ( h S 1 h S 1). NSP(k, 1) of A guarantees x 0 + h 1 > x 0 1, so x o is the unique solution. Necessity. The inequality (1) holds with equality if sign(x o S) = sign(h S) and h S has a sufficiently small scale. Therefore, basis pursuit to uniquely recovers all k-sparse vectors x o, NSP(k, 1) is also necessary. 14 / 35

The null space of A The null space of A Another sufficient condition (Zhang [2008]) for x 1 < y 1 is The null space of A x 0 < 1 ( ) 2 y x 1. 4 y x 2 Proof: es 1 S es 2 S e 2 = x 0 e 2. Then, the above sufficient condition y x 1 > 2 (y x)s 1 is given the above inequality. Another sufficient condition (Zhang [2008]) for x 1 < y 1 is ( ) 2 y x 1. x 0 < 1 4 y x 2 Proof: e S 1 S e S 2 S e 2 = x 0 e 2. Then, the above sufficient condition y x 1 > 2 (y x) S 1 is given the above inequality. 15 / 35

Theorem (Zhang [2008]) Given x and b = Ax, Null space min x 1 s.t. Ax = b Null space Theorem (Zhang [2008]) Given x and b = Ax, recovers x uniquely if { 1 x 0 < min 4 Comments: Null space min x 1 s.t. Ax = b e 2 1 e 2 2 } : e N (A) \ {0}. We know 1 e 1/ e 2 n for all e 0. The ratio is small for sparse vectors but we want it large, i.e., close to n and away from 1. Fact: in most subspaces, the ratio is away from 1 In particular, Kashin, Garvaev, and Gluskin showed that a randomly drawn (n m)-dimensional subspace V satisfies e 1 e 2 c1 m, e V, e 0 1 + log(n/m) with probability at least 1 exp( c0(n m)), where c0, c1 > 0 are independent of m and n. recovers x uniquely if { 1 e 2 1 x 0 < min 4 e 2 2 Comments: } : e N (A) \ {0}. We know 1 e 1/ e 2 n for all e 0. The ratio is small for sparse vectors but we want it large, i.e., close to n and away from 1. Fact: in most subspaces, the ratio is away from 1 In particular, Kashin, Garvaev, and Gluskin showed that a randomly drawn (n m)-dimensional subspace V satisfies e 1 c 1 m, e V, e 0 e 2 1 + log(n/m) with probability at least 1 exp( c 0(n m)), where c 0, c 1 > 0 are independent of m and n. 16 / 35 The coherence-based theorem allows x 0 up tp O( m), while this theorem allows x 0 O(m/log(n/m)), which is typically larger. Fixing sparsity k and signal dimension n, we obtain the restriction for the number of measurementsm :m O(k log(n/k)), which tells us how many measurements are required for recovery.

Null space Null space Theorem (Zhang [2008]) Null space If A R m n is sampled from i.i.d. Gaussian or is any rank-m matrix such that BA = 0 and B R (n m) m is i.i.d. Gaussian, then with probability at least 1 exp( c0(n m)), l1 minimization recovers any sparse x if x 0 < c2 1 4 m 1 + log(n/m), where c0, c1 are positive constants independent of m and n. Theorem (Zhang [2008]) If A R m n is sampled from i.i.d. Gaussian or is any rank-m matrix such that BA = 0 and B R (n m) m is i.i.d. Gaussian, then with probability at least 1 exp( c 0(n m)), l 1 minimization recovers any sparse x if x 0 < c2 1 4 m 1 + log(n/m), where c 0, c 1 are positive constants independent of m and n. 17 / 35

Comments on NSP NSP is no longer necessary if for all k-sparse vectors is relaxed. Comments on NSP Comments on NSP NSP is no longer necessary if for all k-sparse vectors is relaxed. NSP is widely used in the proofs of other guarantees. NSP of order 2k is necessary for stable universal recovery. Consider an arbitrary decoder, tractable or not, that returns a vector from the input b = Ax o. If one requires to be stable in the sense x o (Ax o ) 1 < C σ[k](x o ) for all x o and σ[k] is the best k-term approximation error, then it holds hs 1 < C hs c 1, for all non-zero h N (A) and all coordinate sets S with S 2k. See Cohen, Dahmen, and DeVore [2006]. NSP is widely used in the proofs of other guarantees. NSP of order 2k is necessary for stable universal recovery. Consider an arbitrary decoder, tractable or not, that returns a vector from the input b = Ax o. If one requires to be stable in the sense If the decoder is stable, then matrix A must satisfy NSP(2k, C) property. h S 1 < C h S c 1 indicates that the energy of any vector in the null space couldn t be too concentrated. x o (Ax o ) 1 < C σ [k] (x o ) for all x o and σ [k] is the best k-term approximation error, then it holds h S 1 < C h S c 1, for all non-zero h N (A) and all coordinate sets S with S 2k. See Cohen, Dahmen, and DeVore [2006]. 18 / 35

Restricted isometry property (RIP) Restricted isometry property (RIP) Restricted isometry property (RIP) Definition (Candes and Tao [2005]) Matrix A obeys the restricted isometry property (RIP) with constant δs if (1 δs) c 2 2 Ac 2 2 (1 + δs) c 2 2 for all s-sparse vectors c. RIP essentially requires that every set of columns with cardinality less than or equal to s behaves like an orthonormal system. Definition (Candes and Tao [2005]) Matrix A obeys the restricted isometry property (RIP) with constant δ s if When δ S = 0, A S is an isometry. for all s-sparse vectors c. (1 δ s) c 2 2 Ac 2 2 (1 + δ s) c 2 2 RIP essentially requires that every set of columns with cardinality less than or equal to s behaves like an orthonormal system. Figure: e 2 = Ae 2 If A doesn t reserve the norm well, two planes may collapse (or nearly collapse) into a single one. Under this measurement, the solution cannot be recovered. Because for each vector on the collapsed image, we have no idea which plane it comes from. 19 / 35 The smaller δ S is, often the quicker the recovering algorithm will be.

RIP RIP Theorem (Candes and Tao [2006]) If x is k-sparse and A satisfies δ2k + δ3k < 1, then x is the unique l1 minimizer. Comments: Theorem (Candes and Tao [2006]) If x is k-sparse and A satisfies δ 2k + δ 3k < 1, then x is the unique l 1 minimizer. RIP RIP needs a matrix to be properly scaled the tight RIP constant of a given matrix A is difficult to compute the result is universal for all k-sparse tighter conditions (see next slide) all methods (including l0) require δ2k < 1 for universal recovery; every k-sparse x is unique if δ2k < 1 the requirement can be satisfied by certain A (e.g., whose entries are i.i.d samples following a subgaussian distribution) and lead to exact recovery for x 0 = O(m/ log(m/k)). Comments: RIP needs a matrix to be properly scaled the tight RIP constant of a given matrix A is difficult to compute the result is universal for all k-sparse tighter conditions (see next slide) all methods (including l 0) require δ 2k < 1 for universal recovery; every k-sparse x is unique if δ 2k < 1 the requirement can be satisfied by certain A (e.g., whose entries are i.i.d samples following a subgaussian distribution) and lead to exact recovery for x 0 = O(m/ log(m/k)). 20 / 35

More Comments More Comments More Comments (Foucart-Lai) If δ2k+2 < 1, then a sufficiently small p so that lp minimization is guaranteed to recovery any k-sparse x (Candes) δ2k < 2 1 is sufficient (Foucart-Lai) δ2k < 2(3 2)/7 0.4531 is sufficient RIP gives κ(as) (1 + δk)/(1 δk), S k; so δ2k < 2(3 2)/7 gives κ(as) 1.7, S 2m, very well-conditioned. (Mo-Li) δ2k < 0.493 is sufficient (Cai-Wang-Xu) δk < 0.307 is sufficient (Cai-Zhang) δk < 1/3 is sufficient and necessary for universal l1 recovery (Foucart-Lai) If δ 2k+2 < 1, then a sufficiently small p so that l p minimization is guaranteed to recovery any k-sparse x (Candes) δ 2k < 2 1 is sufficient (Foucart-Lai) δ 2k < 2(3 2)/7 0.4531 is sufficient RIP gives κ(a S) (1 + δ k )/(1 δ k ), S k; so δ 2k < 2(3 2)/7 gives κ(a S) 1.7, S 2m, very well-conditioned. (Mo-Li) δ 2k < 0.493 is sufficient (Cai-Wang-Xu) δ k < 0.307 is sufficient (Cai-Zhang) δ k < 1/3 is sufficient and necessary for universal l 1 recovery 21 / 35

Random matrices with RIPs Trivial randomly constructed matrices satisfy RIPs with overwhelming probability. Random matrices with RIPs Random matrices with RIPs Trivial randomly constructed matrices satisfy RIPs with overwhelming probability. Gaussian: Aij N(0, 1/m), x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the singular values of Gaussian matrices (Szarek-91,Davidson-Szarek-01). Bernoulli: Aij ±1 wp 1/2, x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the smallest singular value of a subgaussian matrix (Candes-Tao-04,Litvak-Pajor-Rudelson-TomczakJaegermann-04). Fourier ensemble: A C m n is a randomly chosen submatrix of discrete Fourier transform F C n n. Candes-Tao shows x 0 O(m/ log(n) 6 ) whp; Rudelson-Vershynin shows x 0 O(m/ log(n) 4 ); conjectured x 0 O(m/ log(n)).... Gaussian: A ij N(0, 1/m), x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the singular values of Gaussian matrices (Szarek-91,Davidson-Szarek-01). Bernoulli distribution generates a 0-1 matrix, which in some situations is easy for hardware implementation. Bernoulli: A ij ±1 wp 1/2, x 0 O(m/ log(n/m)) whp, proof is based on applying concentration of measures to the smallest singular value of a subgaussian matrix (Candes-Tao-04,Litvak-Pajor-Rudelson-TomczakJaegermann-04). Fourier ensemble: A C m n is a randomly chosen submatrix of discrete Fourier transform F C n n. Candes-Tao shows x 0 O(m/ log(n) 6 ) whp; Rudelson-Vershynin shows x 0 O(m/ log(n) 4 ); conjectured x 0 O(m/ log(n)).... 22 / 35

1 k,j n Incoherent Sampling Suppose (Φ, Ψ) is a pair of orthonormal bases of R n. Φ is used for sensing: A is a subset of rows of Φ Ψ is used to sparsely represent x: x = Ψα, α is sparse Definition The coherence between Φ and Ψ is Incoherent Sampling Incoherent Sampling Suppose (Φ, Ψ) is a pair of orthonormal bases of R n. Φ is used for sensing: A is a subset of rows of Φ Ψ is used to sparsely represent x: x = Ψα, α is sparse Definition The coherence between Φ and Ψ is µ(φ, Ψ) = n max φk, ψj Coherence is the largest correlation between any two elements of Φ and Ψ. If Φ and Ψ contains correlated elements, then µ(φ, Ψ) is large Otherwise, µ(φ, Ψ) is small From linear algebra, 1 µ(φ, Ψ) n. Mutual incoherence deals with the case where the signal is not sparse itself but become (nearly) sparse under a proper basis. That is, by changing the basis from I to Ψ, its energy becomes concentrated, and its support becomes small. µ(φ, Ψ) = n max φ k, ψ j 1 k,j n Coherence is the largest correlation between any two elements of Φ and Ψ. If Φ and Ψ contains correlated elements, then µ(φ, Ψ) is large Otherwise, µ(φ, Ψ) is small Figure: Changing the basis From linear algebra, 1 µ(φ, Ψ) n. 23 / 35

Incoherent Sampling Compressive sensing requires low coherent pairs. x is sparse under Ψ: x = Ψα, α is sparse x is measured as b Ax x is recovered from min α 1, s.t. AΨα = b Examples: Φ is spike basis φ k (t) = δ(t k) and Ψ is the Fourier basis ψ j(t) = n 1/2 e i 2π jt/n ; then µ(φ, Ψ) = 1, achieving max incoherence. Coherence between noiselets and Haar wavelets is 2. Coherence between noiselets and Baubechies D4 and D8 are 2.2 and 2.9, respectively. Random matrices are largely incoherent with any fixed basis Ψ. Randomly generated and orthonormalized Φ: w.h.p., the coherence between Φ and any fixed Ψ is about 2 log n. Similar results apply to random Gaussian or ±1 matrices. Bottom line: many random matrices are universally incoherent with any fixed Ψ w.h.p. some kind of random circulant matrix is universally incoherent with any fixed Ψ w.h.p. 24 / 35 Incoherent Sampling Incoherent Sampling Compressive sensing requires low coherent pairs. x is sparse under Ψ: x = Ψα, α is sparse x is measured as b Ax x is recovered from min α 1, s.t. AΨα = b Examples: Φ is spike basis φk(t) = δ(t k) and Ψ is the Fourier basis ψj(t) = n 1/2 e i 2π jt/n ; then µ(φ, Ψ) = 1, achieving max incoherence. Coherence between noiselets and Haar wavelets is 2. Coherence between noiselets and Baubechies D4 and D8 are 2.2 and 2.9, respectively. Random matrices are largely incoherent with any fixed basis Ψ. Randomly generated and orthonormalized Φ: w.h.p., the coherence between Φ and any fixed Ψ is about 2 log n. Similar results apply to random Gaussian or ±1 matrices. Bottom line: many random matrices are universally incoherent with any fixed Ψ w.h.p. some kind of random circulant matrix is universally incoherent with any Why do we expect a small µ(φ, Ψ)?(Incoherence) Suppose Φ is given. Take Ψ = Φ. A downsamples from Φ, that is A = PΦ T. Here, P is a 0-1 matrix, with one 1 in each row, and at most one 1 in each column. Thus, AΨα = b PΦ T Ψα = b Since Ψ is an orthnormal matrix, Φ T Ψ = I. we derive Pα = b, which indicates b is downsampled from α directly. If we lose any entries in the sampling, we will lose it forever. Thus P should be a permutation matrix. The signal α could not be compressed. fixed Ψ w.h.p.

Incoherent Sampling Incoherent Sampling Theorem (Candes and Romberg [2007]) Fix x and suppose x is k-sparse under basis Ψ with coefficients in uniformly random signs. Select m measurements in the Φ domain uniformly at random. If m O(µ 2 (Φ, Ψ)k log(n)), l1-minimization recovers x with high probability. Theorem (Candes and Romberg [2007]) Fix x and suppose x is k-sparse under basis Ψ with coefficients in uniformly Incoherent Sampling Comments: The result is not universal for all Ψ or all k-sparse x under Ψ. Only guaranteed for nearly all sign sequences x with a fixed support. Why seeing probability? Because there are special signals that are sparse in Ψ yet vanish at most places in the Φ domain. This result allows structured, as opposed to noise-like (random), matrices. Can be seen as an extension to Fourier CS. Bottom line: the smaller the coherence, the fewer the samples required. This matches numerical experience. random signs. Select m measurements in the Φ domain uniformly at random. If m O(µ 2 (Φ, Ψ)k log(n)), l 1-minimization recovers x with high probability. Comments: The result is not universal for all Ψ or all k-sparse x under Ψ. Only guaranteed for nearly all sign sequences x with a fixed support. Why seeing probability? Because there are special signals that are sparse in Ψ yet vanish at most places in the Φ domain. This result allows structured, as opposed to noise-like (random), matrices. Can be seen as an extension to Fourier CS. Bottom line: the smaller the coherence, the fewer the samples required. This matches numerical experience. 25 / 35

Robust Recovery Robust Recovery In order to be practically powerful, CS must deal with Robust Recovery nearly sparse signals measurement noise sometimes both Goal: To obtain accurate reconstructions from highly undersampled measurements, or in short, stable recovery. In order to be practically powerful, CS must deal with nearly sparse signals measurement noise sometimes both Goal: To obtain accurate reconstructions from highly undersampled measurements, or in short, stable recovery. 26 / 35

Stable l 1 Recovery Consider a sparse x Stable l1 Recovery noisy CS measurements b Ax + z, where z 2 ɛ Apply the BPDN model: min x 1 s.t. Ax b 2 ɛ. Consider a sparse x noisy CS measurements b Ax + z, where z 2 ɛ Stable l 1 Recovery Theorem Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x satisfying for some constant C. x x 2 C ɛ Proof sketch (using an overly sufficient RIP bound): Let e = x x. S = {i : largest 2k x (i)}. One can show e 2 C1 es 2 C2 Ae 2 C ɛ. 1st inequality essentially from es 1 > e S 1, 2nd inequality essentially from the RIP; 3rd inequality essentially from the constraint. Apply the BPDN model: min x 1 s.t. Ax b 2 ɛ. Theorem Assume (some bounds on δ k or δ 2k ). The solution of the BPDN model returns a solution x satisfying x x 2 C ɛ Here, z denotes the unknown noise in the measurements. The accuracy of the recovery is limited by the energy of noise. C 2 Ae 2 = C 2 Ax b (Ax b) 2 C 2 ( z 2 + (Ax b) 2 ) C ɛ. for some constant C. Proof sketch (using an overly sufficient RIP bound): Let e = x x. S = {i : largest 2k x (i) }. One can show e 2 C 1 e S 2 C 2 Ae 2 C ɛ. 1st inequality essentially from e S 1 > e S 1, 2nd inequality essentially from the RIP; 3rd inequality essentially from the constraint. 27 / 35

Stable l 1 Recovery Stable l1 Recovery Theorem Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x satisfying x x 2 C ɛ Theorem Assume (some bounds on δ k or δ 2k ). The solution of the BPDN model returns Stable l 1 Recovery for some constant C. Comments: The result is universal and more general than exact recovery; The error bound is order-optimal: knowing supp(x) will give C ɛ at best; x is almost as good as if one knows where the largest k entries are and directly measure them; C depends on k; when k violates the condition and gets too large, x x 2 will blow up. a solution x satisfying x x 2 C ɛ for some constant C. Comments: The result is universal and more general than exact recovery; The error bound is order-optimal: knowing supp(x) will give C ɛ at best; x is almost as good as if one knows where the largest k entries are and directly measure them; C depends on k; when k violates the condition and gets too large, x x 2 will blow up. 28 / 35

Stable l 1 Recovery Stable l1 Recovery Consider a nearly sparse x = xk + w, xk is the vector x with all but the largest (in magnitude) k entries set to 0, CS measurements b Ax + z, where z 2 ɛ. Consider a nearly sparse x = x k + w, Stable l 1 Recovery Theorem Assume (some bounds on δk or δ2k). The solution of the BPDN model returns a solution x satisfying x x 2 C ɛ + C k 1/2 x xk 1 for some constants C and C. Proof sketch (using an overly sufficient RIP bound): Similar to the previous one, except x is no longer k-sparse and es 1 > e S 1 is no longer valid. Instead, we get es 1 + 2 x xk 1 > e S 1. Then, e 2 C1 e4k 2 + C k 1/2 x xk 1,... x k is the vector x with all but the largest (in magnitude) k entries set to 0, CS measurements b Ax + z, where z 2 ɛ. Theorem Assume (some bounds on δ k or δ 2k ). The solution of the BPDN model returns a solution x satisfying We could decompose x into a k-sparse signal x k and some noise w We improve the stability analysis by dealing with nearly sparse signal. The theorem holds since small entries could never be recovered. for some constants C and C. x x 2 C ɛ + C k 1/2 x x k 1 Proof sketch (using an overly sufficient RIP bound): Similar to the previous one, except x is no longer k-sparse and e S 1 > e S 1 is no longer valid. Instead, we get e S 1 + 2 x x k 1 > e S 1. Then, e 2 C 1 e 4k 2 + C k 1/2 x x k 1,... Figure: Some part couldn t be recovered 29 / 35

Stable l 1 Recovery Comments on Stable l1 Recovery x x 2 C ɛ + C k 1/2 x xk 1. Stable l 1 Recovery Suppose ɛ = 0; let us focus on the last term: 1. Consider power-law decay signals x (i) C i r, r > 1; 2. Then, x xk 1 C1k r+1 or k 1/2 x xk 1 C1k r+(1/2) ; 3. But even if x = xk, x x 2 = xk x 2 C1k r+(1/2) ; 4. Conclusion: the bound cannot be fundamentally improved. Comments on x x 2 C ɛ + C k 1/2 x x k 1. Suppose ɛ = 0; let us focus on the last term: 1. Consider power-law decay signals x (i) C i r, r > 1; 2. Then, x x k 1 C 1k r+1 or k 1/2 x x k 1 C 1k r+(1/2) ; 3. But even if x = x k, x x 2 = x k x 2 C 1k r+(1/2) ; 4. Conclusion: the bound cannot be fundamentally improved. C ɛ corresponds to the error brought by noise, and C k 1/2 x x k 1 corresponds to the small entries in the signal. Here, we use a special signal the power-law decaying signal, to illustrate the estimate of k 1/2 x x 1 is tight. That is, we could only find better C but cannot improve the order on this signal. 30 / 35

Information Theoretic Analysis Information Theoretic Analysis Question: is there an encoding-decoding means that can do fundamentally better than Gaussian A and l1-minimization? Information Theoretic Analysis In math: encoder-decoder pair (A, ), (Ax) x 2 O(k 1/2 σk(x)) holds for k larger than O(m/ log(n/m))? Comments: A can be any matrix, and can be any decoder, tractable or not. Let x xk 1 be called the best-k approximation error, denoted by σk(x) := x xk 1. Question: is there an encoding-decoding means that can do fundamentally better than Gaussian A and l 1-minimization? In math: encoder-decoder pair (A, ), (Ax) x 2 O(k 1/2 σ k (x)) holds for k larger than O(m/ log(n/m))? Random matrix (such as Gaussian matrices) RIP Stable recovery. The relationship between the rows of matrix and the sparsity of the signal is: m O(k log(n/k)) or k O(m/ log(n/m)) Comments: A can be any matrix, and can be any decoder, tractable or not. Let x x k 1 be called the best-k approximation error, denoted by σ k (x) := x x k 1. 31 / 35

(A, ) x K Performance of (A, ) where A R m n : Performance of (A, ) where A R m n : Gelfand width: d m (K) = E m(k) := inf (A, ) x K sup (Ax) x 2 inf sup { h 2 : h K Y }. codim(y ) m Cohen, Dahmen, and DeVore [2006]: If set K = K and K + K C 0K, then d m (K) E m(k) C 0d m (K). Gelfand width: d m (K) = Em(K) := inf sup (Ax) x 2 inf sup { h 2 : h K Y }. codim(y ) m Cohen, Dahmen, and DeVore [2006]: If set K = K and K + K C0K, then d m (K) Em(K) C0d m (K). E m(k) measures the biggest error for the best pair of encoder-decoder pair. We intend to show E m(k) has the order of O(m/ log(n/m)). That is: a carefully chosen pair cannot work fundamentally better than (Gaussian A, l 1 -minimization) when expecting a stable recovery. For K is a l 1 ball, the Gelfand width is the smallest radius of the section cut by the linear space whose codimension is no more than m. The smaller the Gelfand width, the less uncertainty there is in the recovery. 32 / 35

log(n/m) m Gelfand Width and K = l 1 -ball Kashin, Gluskin, Garnaev: for K = {h R n : h 1 1}, Gelfand Width and K = l 1 -ball Gelfand Width and K = l1-ball Kashin, Gluskin, Garnaev: for K = {h R n : h 1 1}, log(n/m) log(n/m) C1 d m (K) C2 m m. Consequences: 1. KGG means Em(K) 2. we want (Ax) x 2 C k 1/2 σk(x) C k 1/2 x 1; normalizing gives Em(K) C k 1/2. 3. Therefore, k m/ log(n/m). We cannot do better than this. C 1 log(n/m) m d m (K) C 2 log(n/m) m. here means the order of (O( )). Therefore O(m/ log(n/m)) is the loosest restriction for sparsity k, when a stable encoder-decoder pair is expected. Consequences: 1. KGG means E m(k) log(n/m) m 2. we want (Ax) x 2 C k 1/2 σ k (x) C k 1/2 x 1; normalizing gives E m(k) C k 1/2. 3. Therefore, k m/ log(n/m). We cannot do better than this. 33 / 35

(Incomplete) References: D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries vis l 1 minimization. Proceedings of the National Academy of Sciences, 100:2197 2202, 2003. R. Gribonval and M. Nielsen. Sparse representations in unions of bases. IEEE Transactions on Information Theory, 49(12):3320 3325, 2003. E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51:4203 4215, 2005. Y. Zhang. Theory of compressive sensing via l1-minimization: a non-rip analysis and extensions. Rice University CAAM Technical Report TR08-11, 2008. I.F. Gorodnitsky and B.D. Rao. Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm. Signal Processing, IEEE Transactions on, 45(3):600 616, 1997. S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993. D. Donoho and X. Huo. Uncertainty principles and ideal atomic decompositions. IEEE Transactions on Information Theory, 47:2845 2862, 2001. A. Cohen, W. Dahmen, and R. A. DeVore. Compressed sensing and best k-term approximation. Submitted., 2006. (Incomplete) References: D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries vis l1 minimization. Proceedings of the National Academy of Sciences, 100:2197 2202, 2003. R. Gribonval and M. Nielsen. Sparse representations in unions of bases. IEEE Transactions on Information Theory, 49(12):3320 3325, 2003. E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51:4203 4215, 2005. Y. Zhang. Theory of compressive sensing via l1-minimization: a non-rip analysis and extensions. Rice University CAAM Technical Report TR08-11, 2008. I.F. Gorodnitsky and B.D. Rao. Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm. Signal Processing, IEEE Transactions on, 45(3):600 616, 1997. S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993. D. Donoho and X. Huo. Uncertainty principles and ideal atomic decompositions. IEEE Transactions on Information Theory, 47:2845 2862, 2001. A. Cohen, W. Dahmen, and R. A. DeVore. Compressed sensing and best k-term approximation. Submitted., 2006. 34 / 35

E. Candes and T. Tao. Near optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory, 52(1): 5406 5425, 2006. E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3):969 985, 2007. E. Candes and T. Tao. Near optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory, 52(1): 5406 5425, 2006. E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3):969 985, 2007. 35 / 35