Compressed Sensing and Sparse Recovery

ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217

Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing 6-2

Motivation: wastefulness of data acquisition Conventional paradigms for data acquisition: Measure full data Compress (by discarding a large fraction of coefficients) Problem: data is often highly compressible Most of acquired data can be thrown away without any perceptual loss Compressed sensing 6-3

Blind sensing Ideally, if we know a priori which coefficients are worth estimating, then we can simply measure these coefficients Unfortunately, we often have no idea which coefficients are most relevant Compressed sensing: compression on the fly mimic the behavior of the above ideal situation without pre-computing all coefficients often achieved by random sensing mechanism Compressed sensing 6-4

Why go to so much effort to acquire all the data when most of what we get will be thrown away? Can t we just directly measure the part that won t end up being thrown away? David Donoho Compressed sensing 6-5

Setup: sparse recovery = Recover x R p given y = Ax where A = [a 1,, a n ] R n p (n p): sampling matrix; a i : sampling vector; x: sparse signal Compressed sensing 6-6

Restricted isometry properties

Optimality for l minimization minimize x R p x s.t. Ax = y If instead a sparser feasible x x s.t. x x = k, then A (x x) =. (6.1) We don t want (6.1) to happen, so we hope A ( ) x x }{{}, x with x k 2k sparse To simultaneously account for all k-sparse x, we hope A T ( T 2k) to have full column rank, where A T consists of all columns of A at indices from T Compressed sensing 6-8

Restricted isometry property (RIP) Definition 6.1 (Restricted isometry constant) Restricted isometry constant δ k of A is smallest quantity s.t. (1 δ k ) x 2 Ax 2 (1 + δ k ) x 2 (6.2) holds for all k-sparse vector x R p Equivalently, (6.2) says max S: S =k A S A S I k = δ k }{{} near orthonormality where A S consists of all columns of A at indices from S Compressed sensing 6-9

Restricted isometry property (RIP) Definition 6.1 (Restricted isometry constant) Restricted isometry constant δ k of A is smallest quantity s.t. (1 δ k ) x 2 Ax 2 (1 + δ k ) x 2 (6.2) holds for all k-sparse vector x R p (Homework) For any x 1, x 2 that are supported on disjoint subsets S 1, S 2 with S 1 s 1 and S 2 s 2 : Ax 1, Ax 2 δ s1+s 2 x 1 2 x 2 2 (6.3) Compressed sensing 6-9

RIP and l minimization Fact 6.2 minimize x R p x s.t. Ax = y (Exercise) Suppose x is k-sparse. If δ 2k < 1, then l -minimization is exact and unique. Compressed sensing 6-1

RIP and l 1 minimization Theorem 6.3 Suppose x is k-sparse. If δ 2k < 2 1, then l 1 -minimization is exact and unique. RIP implies success of l 1 minimization A universal result: works simultaneously for all k-sparse signals As we will see later, many random designs satisfy this condition with near-optimal sample complexity Compressed sensing 6-11

Proof of Theorem 6.3 Suppose x + h is feasible and obeys x + h 1 x 1. The goal is to show that h = under RIP. The key is to decompose h into h T + h T1 +... T : locations of k largest entries of x T 1 : locations of k largest entries of h in T c T 2 : locations of k largest entries of h in (T T 1 ) c... Compressed sensing 6-12

Proof of Theorem 6.3 The proof proceeds by showing that 1. h T T 1 dominates h (T T 1 ) c (by objective function) 2. (converse) h (T T 1 ) c dominates h T T 1 (by RIP + feasibility) These can happen simultaneously only when h vanishes Compressed sensing 6-13

Proof of Theorem 6.3 Step 1 (depending only on objective function). Show that j 2 h T j 1 k h T 1. (6.4) This follows immediately by combining the following 2 observations: (i) Since x + h is assumed to be a better estimate: x 1 x + h 1 = x + h T 1 + h T c 1 x 1 h T 1 + h T c }{{}}{{} 1 since T is support of x triangle inequality = h T c 1 h T 1 (6.5) (ii) Since entries of h Tj 1 uniformly dominate those of h Tj (j 2): h Tj k h Tj k h T j 1 1 k k = j 2 h Tj 1 j 2 = 1 k h Tj 1 1 h Tj 1 1 = 1 k h T c 1 (6.6)

Proof of Theorem 6.3 Step 2 (using feasibility + RIP). Show that ρ < 1 s.t. h T T 1 ρ j 2 h T j (6.7) If this claim holds, then h T T 1 ρ h T j (6.4) ρ 1 h T 1 j 2 k ρ 1 ( ) k ht = ρ h T ρ h T T 1. (6.8) k Since ρ < 1, we necessarily have h T T 1 =, which together with (6.5) yields h = Compressed sensing 6-15

Proof of Theorem 6.3 We now prove (6.7). To connect h T T 1 with h (T T 1) c, we use feasibility: Ah = Ah T T 1 = j 2 Ah T j, which taken collectively with RIP yields (1 δ 2k ) h T T 1 2 Ah T T 1 2 = AhT T 1, j 2 Ah T j. It follows from (6.3) that for all j 2, which gives Ah T T 1, Ah Tj Ah T, Ah Tj + Ah T1, Ah Tj (6.3) δ 2k ( h T + h T1 ) h Tj δ 2k 2 ht T 1 h Tj, (1 δ 2k ) h T T 1 2 j 2 Ah T T 1, Ah Tj 2δ 2k h T T 1 j 2 h T j This establishes (6.7) if ρ := 2δ 2k 1 δ 2k < 1 (or equivalently, δ 2k < 2 1). Compressed sensing 6-16

Theorem 6.4 Robustness for compressible signals If δ 2k < 2 1, then the solution ˆx to l 1 -minimization obeys ˆx x x x k 1 k, where x k is best k-term approximation of x Suppose l th largest entry of x is 1/l α for some α > 1, then 1 x x k 1 1 k k l>k l α k α+.5 1 l 1 -min works well in recovering compressible signals Follows similar arguments as in proof of Theorem 6.3 Compressed sensing 6-17

Proof of Theorem 6.4 Step 1 (depending only on objective function). Show that j 2 h T j 1 k h T 1 + 2 k x x T 1. (6.9) This follows immediately by combining the following 2 observations: (i) Since x + h is assumed to be a better estimate: x T 1 + x T c 1 = x 1 x + h 1 = x T + h T 1 + x T c + h T c 1 x T 1 h T 1 + h T c 1 x T c 1 = h T c 1 h T 1 + 2 x T c 1 (6.1) (ii) Recall from (6.6) that j 2 h T j 1 k h T c 1. We highlight in red the part different from the proof of Theorem 6.3. Compressed sensing 6-18

Proof of Theorem 6.4 Step 2 (using feasibility + RIP). Recall from (6.7) that ρ < 1 s.t. If this claim holds, then h T T 1 ρ j 2 h T j h T T 1 ρ j 2 h T j (6.11) (6.1) and (6.6) ρ 1 k { h T 1 + 2 x T c 1 } ρ 1 k ( k ht + 2 x T c 1 ) = ρ h T + 2ρ k x T c 1 ρ h T T 1 + 2ρ k x T c 1. = h T T 1 2ρ x T c 1. (6.12) 1 ρ k We highlight in red the part different from the proof of Theorem 6.3. Compressed sensing 6-19

Proof of Theorem 6.4 Finally, putting the above together yields h h T T 1 + h (T T 1 ) c (6.9) h T T 1 + 1 k h T 1 + 2 k x x T 1 h T T 1 + h T + 2 k x x T 1 2 h T T 1 + 2 k x x T 1 (6.12) 2(1 + ρ) x x T 1 1 ρ k We highlight in red the part different from the proof of Theorem 6.3. Compressed sensing 6-2

Which design matrix satisfies RIP? First example: i.i.d. Gaussian design Lemma 6.5 ( ) A random matrix A R n p with i.i.d. N, 1 n entries satisfies δ k < δ with high prob., as long as n 1 δ 2 k log p k This is where non-asymptotic random matrix theory enters Compressed sensing 6-21

Gaussian random matrices Lemma 6.6 (See Vershynin 1) Suppose B R n k is composed of i.i.d. N (, 1) entries. Then ) P( n 1 σ max (B) > 1 + k n + t e nt2 /2 ) P( n 1 σ min (B) < 1 k n t e nt2 /2. When n k, one has 1 n B B I k Similar results (up to different constants) hold for i.i.d. sub-gaussian matrix Compressed sensing 6-22

Proof of Lemma 6.5 1. Fix any index subset S {1,, }, S = k, then A S (submatrix of A consisting of columns at indices from S) obeys ) A S A S I k O ( k/n + t with prob. exceeding 1 2e c 1nt 2, where c 1 > is constant. 2. Taking a union bound over all S {1,, p}, S = k yields ) δ k = max A S A S I k O ( k/n + t S: S =k with prob. exceeding 1 2 ( p k) e c 1 nt 2 1 2e k log(ep/k) c 1nt 2. Thus, δ k < δ with high prob. as long as n δ 2 k log(p/k). Compressed sensing 6-23

Other design matrices that satisfy RIP Random matrices with i.i.d. sub-gaussian entries, as long as n k log(p/k) Random partial DFT matrices with n k log 4 p, where rows of A are independently sampled from rows of DFT matrix F (Rudelson & Vershynin 8) If you have learned entropy method / generic chaining, check out Rudelson & Vershynin 8 and Candes & Plan 11 Compressed sensing 6-24

Other design matrices that satisfy RIP Random convolution matrices with n k log 4 p, where rows of A are independently sampled from rows of G = g g 1 g 2 g p 1 g p 1 g g 1 g p 2 g p 2 g p 1 g g p 3....... g 1 g 2 g 3 g with P(g i = ±1) =.5 (Krahmer, Mendelson, & Rauhut 14) Compressed sensing 6-25

RIP guarantees success of many other methods Example: projected gradient descent (iterative hard thresholding) alternate between gradient descent: z t x t µ t A (Ax t y) }{{} gradient of 1 2 y Ax 2 projection: keep only k largest (in magnitude) entries Compressed sensing 6-26

Iterative hard thresholding (IHT) Algorithm 6.1 Projected gradient descent / iterative hard thresholding for t =, 1, : ) x t+1 = P k (x t µ t A (Ax t y) where P k (x) := arg min z x is best k-term approximation of x z =k Compressed sensing 6-27

Geometric convergence of IHT under RIP Theorem 6.7 (Blumensath & Davies 9) Suppose x is k-sparse, and RIP constant δ 3k < 1/2. Then taking µ t 1 gives x t x (2δ 3k ) t x x ( ) Under RIP, IHT attains ɛ-accuracy within O log 1 ɛ iterations Each iteration takes time proportional to a matrix-vector product Drawback: need prior knowledge on k Compressed sensing 6-28

Numerical performance of IHT Relative error xt x x vs. iteration count t (n = 1, k = 5, p = 1, A i,j N (, 1/n)) Compressed sensing 6-29

Proof of Theorem 6.7 Let z := x t A (Ax t y) = x t A A(x t x). By definition of P k, x }{{} k-sparse z 2 x }{{} t+1 z 2 best k-sparse = x t+1 x 2 2 x t+1 x, z x + z x 2 which gives = x t+1 x 2 2 x t+1 x, z x = 2 x t+1 x, (I A A)(x t x) 2δ 3k x t+1 x x t x (6.13) x t+1 x 2δ 3k x t x as claimed. Here, (6.13) follows from the following fact (homework) u, (I A A)v δ s u v with s = supp (u) supp (v) Compressed sensing 6-3

A RIPless theory

Is RIP necessary? RIP leads to a universal result holding simultaneously for all k-sparse x Universality is often not needed as we might only care about a particular x There may be a gap between the regime where RIP holds and the regime in which one has minimal measurements Certifying RIP is hard Can we develop a non-universal RIPless theory? Compressed sensing 6-32

A standard recipe 1. Write out Karush-Kuhn-Tucker (KKT) optimality conditions typically involve certain dual variables 2. Construct dual variables satisfying KKT conditions Compressed sensing 6-33

Karush-Kuhn-Tucker (KKT) condition Consider a convex problem minimize x f(x) s.t. Ax y = Lagrangian: L (x, ν) := f(x) + ν (Ax y) (ν : Lagrangian multiplier) If x is optimizer, then KKT optimality condition reads { = v L(x, v) x L(x, v) Compressed sensing 6-34

KKT condition for l 1 minimization minimize x x 1 s.t. Ax y = If x is optimizer, then KKT optimality condition reads { Ax y =, (naturally satisfied as x is truth) x 1 + A ν (no constraint on ν) u range(a ) s.t. { ui = sign(x i ), if x i u i [ 1, 1], else Depends only on signs of x i s irrespective of their magnitudes Compressed sensing 6-35

Uniqueness Theorem 6.8 (A sufficient and almost necessary condition) Let T := supp(x). Suppose A T has full column rank. If u = A ν for some ν R n s.t. { ui = sign(x i ), if x i, u i ( 1, 1), else then x is unique solution to l 1 minimization. Only slightly stronger than KKT! ν is said to be a dual certificate recall that ν is Lagrangian multiplier Finding ν comes down to solving another convex problem Compressed sensing 6-36

Geometric interpretation of dual certificate When u 1 < 1, solution is unique When u 1 = 1, solution is non-unique When we are able to find u range(a ) s.t. u 2 = sign(x 2 ) and u 1 < 1, then x (with x 1 = ) is unique solution to l 1 minimization Compressed sensing 6-37

Let w x 1 be optimizer with h T c Proof of Theorem 6.8 { w i = sign(x i ), w i = sign(h i ),, then if i T (support of x); else. If x + h is x 1 x + h 1 x 1 + w, h = x 1 + u, h + w u, h resulting in contradiction. = x 1 + A }{{ ν } h + assumption on u, (sign(h i )h i u i h i ) i/ T = x 1 + ν, }{{} Ah + := (feasibility) ( h i u i h i ) i/ T x 1 + i/ T (1 u i ) h i > x 1, Further, when h T c =, one must have h T = from left-invertibility of A T, and hence h = h T + h T c = Compressed sensing 6-38

Constructing dual certificates under Gaussian design We illustrate how to construct dual certificates for the following setup x R p is k-sparse Entries of A R n p are i.i.d. standard Gaussian Sample size n obeys n k log p Compressed sensing 6-39

Constructing dual certificates under Gaussian design Find ν R n s.t. (A ν) T = sign(x T ) (6.14) (A ν) i < 1, i / T (6.15) Step 1: propose a ν compatible with linear constraints (6.14). One candidate is least squares solution: ν = A T (A T A T ) 1 sign(x T ) (explicit expression) LS solution minimizes ν, which will also be helpful when controlling (A ν) i From Lemma 6.6, A T A T is invertible when n k log p Compressed sensing 6-4

Constructing dual certificates under Gaussian design Step 2: verify (6.15), which amounts to controlling max A :,i, A T (A T A T ) 1 sign(x T ) i/ T }{{}}{{} ith column of A ν Since A :,i N (, I n ) and ν are independent for any i / T, ν can be bounded by max i/ T A :,i, ν ν log p ν A T (A T A T ) 1 sgn(x T ) = ( A T A T ) 1/2 k k/n }{{} eigenvalues n When n/(k log p) is sufficiently large, max i/ T A :,i, ν < 1 Exerciese: fill in missing details Compressed sensing 6-41

More general random sampling Consider a random design: each sampling vector a i is independently drawn from a distribution F a i F Incoherence sampling: Isotropy: E[aa ] = I, a F components of a: (i) unit variance; (ii) uncorrelated Incoherence: let µ(f ) be the smallest quantity s.t. for a F, a 2 µ(f ) with high prob. Compressed sensing 6-42

Incoherence We want µ(f ) (resp. A) to be small (resp. dense)! What happen if sampling vectors a i are sparse? Example: a i Uniform({ pe 1,, pe p }) }{{} y no information = p 1 1 1 1 }{{} A 3 5 }{{} x Compressed sensing 6-43

Incoherent random sampling Compressed sensing 6-44

A general RIPless theory Theorem 6.9 (Candes & Plan, 11) Suppose x R p is k-sparse, and a i ind. F is isotropic. Then l 1 minimization is exact and unique with high prob., provided that n µ(f )k log p Near-optimal even for highly structured sampling matrices Proof idea: produce an (approximate) dual certificate by a clever golfing scheme pioneered by David Gross Compressed sensing 6-45

Examples of incoherent sampling Binary sensing: P(a[i] = ±1) =.5: E[aa ] = I, a 2 = 1, µ = 1 = l 1 -min succeeds if n k log p Partial Fourier transform: pick a random frequency f Unif {, 1 p,, p 1 } p or f Unif[, 1] and set a[i] = e j2πfi : E[aa ] = I, a 2 = 1, µ = 1 = l 1 -min succeeds if n k log p Improves upon RIP-based result (n k log 4 p) Compressed sensing 6-46

Examples of incoherent sampling Random convolution matrices: rows of A are independently sampled from rows of g g 1 g 2 g p 1 g p 1 g g 1 g p 2 G = g p 2 g p 1 g g p 3....... g 1 g 2 g 3 g with P(g i = ±1) =.5. One has E[aa ] = I, a 2 = 1, µ = 1 = l 1 -min succeeds if n k log p Improves upon RIP-based result (n k log 4 p) Compressed sensing 6-47

A general scheme for dual construction Find ν R n s.t. A T ν = sign(x T ) (6.16) A T cν < 1 (6.17) A candidate: least squares solution w.r.t. (6.16) ν = A T (A T A T ) 1 sign(x T ) (explicit expression) To verify (6.17), we need to control A T ca T (A T A T ) 1 sign(x T ) Issue 1: in general, A T c and A T are dependent Issue 2: (A T A T ) 1 is hard to deal with Compressed sensing 6-48

A general scheme for dual construction Find ν R n s.t. A T ν = sign(x T ) (6.16) A T cν < 1 (6.17) Key idea 1: use iterative scheme to solve 1 minimize ν 2 A T ν sign(x T ) 2 for t = 1, 2, ) ν (t) = ν (t 1) A T (A T ν (t 1) sign(x T ) }{{} grad of 1 2 A T v sign(x T ) 2 Converges to a solution obeying (6.16); no inversion involved Issue: complicated dependency across iterations Compressed sensing 6-48

Golfing scheme (Gross 11) Key idea 2: sample splitting use independent samples for each iteration to decouple statistical dependency Partition A into L row blocks A (1) R n1 p,, A (L) R n L p }{{} independent for t = 1, 2, (stochastic gradient) ν (t) = ν (t 1) µ t A (t) ( ) T A T ν (t 1) sign(x T ) }{{} R n t (but we need ν R n ) Compressed sensing 6-49

Golfing scheme (Gross 11) Key idea 2: sample splitting use independent samples for each iteration to decouple statistical dependency Partition A into L row blocks A (1) R n1 p,, A (L) R n L p }{{} independent for t = 1, 2, (stochastic gradient) ν (t) = ν (t 1) µ t Ã (t) ( ) T A T ν (t 1) sign(x T ) where Ã(t) = A (t) R n p is obtained by zero-padding Compressed sensing 6-49

Golfing scheme (Gross 11) ν (t) = ν (t 1) µ t Ã (t) T A T ν (t 1) sign(x T ) }{{} depends only on A (1),,A (t 1) Statistical independence across iterations By construction, A T ν(t 1) range(a (i) T ) range(a (L) (exercise) T ) Each iteration brings us closer to the target (like each golf shot brings us closer to the hole) Compressed sensing 6-5

Reference [1] Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information, E. Candes, J. Romberg, and T. Tao, IEEE Transactions on Information Theory, 26. [2] Compressed sensing, D. Donoho, IEEE Transactions on Information Theory, 26 [3] Near-optimal signal recovery from random projections: Universal encoding strategies?, E. Candes, and T. Tao, IEEE Transactions on Information Theory, 26. [4] Lecture notes, Advanced topics in signal processing (ECE 821), Y. Chi, 215. [5] A mathematical introduction to compressive sensing, S. Foucart and H. Rauhut, Springer, 213. [6] On sparse reconstruction from Fourier and Gaussian measurements, M. Rudelson and R. Vershynin, Communications on Pure and Applied Mathematics, 28. Compressed sensing 6-51

Reference [7] Decoding by linear programming, E. Candes, and T. Tao, IEEE Transactions on Information Theory, 26. [8] The restricted isometry property and its implications for compressed sensing, E. Candes, Compte Rendus de l Academie des Sciences, 28. [9] Introduction to the non-asymptotic analysis of random matrices, R. Vershynin, Compressed Sensing: Theory and Applications, 21. [1] Iterative hard thresholding for compressed sensing, T. Blumensath, M. Davies, Applied and computational harmonic analysis, 29. [11] Recovering low-rank matrices from few coefficients in any basis, D. Gross, IEEE Transactions on Information Theory, 211. [12] A probabilistic and RIPless theory of compressed sensing, E. Candes and Y. Plan, IEEE Transactions on Information Theory, 211. Compressed sensing 6-52

Reference [13] Statistical learning with sparsity: the Lasso and generalizations, T. Hastie, R. Tibshirani, and M. Wainwright, 215. [14] Suprema of chaos processes and the restricted isometry property, F. Krahmer, S. Mendelson, and H. Rauhut, Communications on Pure and Applied Mathematics, 214. Compressed sensing 6-53