The uniform uncertainty principle and compressed sensing Harmonic analysis and related topics, Seville December 5, 2008 Emmanuel Candés (Caltech), Terence Tao (UCLA) 1
Uncertainty principles A basic principle in harmonic analysis is: Uncertainty principle: (informal) If a function f : G C on an abelian group G is concentrated in a small set, then its Fourier transform ˆf : Ĝ C must be spread out over a large set. There are many results that rigorously capture this sort of principle. 2
For instance, for the real line G = R, with the standard Fourier transform ˆf(ξ) = R f(x)e 2πixξ dx, we have Heisenberg uncertainty principle: If f L 2 (R) = ˆf L 2 (R) = 1, and x 0, ξ 0 R, then (x x 0 )f L 2 (R) (ξ ξ 0 ) ˆf L 2 (R) 1 4π. (More succinctly: ( x)( ξ) 1 4π.) Proof: Normalise x 0 = ξ 0 = 0, use the obvious inequality R axf(x) + ibf (x) 2 dx 0, integrate by parts, and optimise in a, b. 3
Equality is attained for centred Gaussians f(x) = ce πax2 ; ˆf(ξ) = c A e πξ2 /A when x 0 = ξ 0 = 0; this example can be translated and modulated to produce similar examples exist for other x 0, ξ 0. 4
What about for finite abelian groups G, e.g. cyclic groups G = Z/NZ? The Pontryagin dual group Ĝ of characters ξ : G R/Z has the same cardinality as G. For f : G C, we define the Fourier transform ˆf : Ĝ C as ˆf(ξ) := f(x)e(ξ x) dx where e(x) := e 2πix and dx = 1 d# is normalised G counting measure. G 5
We have the inversion formula f(x) := ξ Ĝ ˆf(ξ)e( ξ x) and the Plancherel formula f L 2 (G) = ˆf l 2 (Ĝ). 6
The analogue of Gaussians for finite abelian groups are the indicator functions of subgroups. If H G is a subgroup of G, define the orthogonal complement H Ĝ as H := {ξ Ĝ : ξ x = 0 for all x H}. We have the Poisson summation formula 1 H = H G 1 H (in particular, the Fourier transform of 1 is a Dirac mass, and vice versa.) From this and Plancherel we have the basic identity H H = G. 7
More generally, for finite abelian G we have Donoho-Stark uncertainty principle(1989) For any non-trivial f : G C, we have supp(f) supp( ˆf) G. Proof: Combine Plancherel s theorem with the Hölder inequality estimates f L 1 (G) supp(f) 1/2 G 1/2 f L 2 (G); ˆf l 2 (Ĝ) supp( ˆf) 1/2 ˆf l (Ĝ) and the Riemann-Lebesgue inequality ˆf l (Ĝ) f L 1 (G). 8
One can show that equality is attained precisely for the indicators 1 H of subgroups, up to translation, modulation, and multiplication by constants. 9
One also has a slightly more quantitative variant: Entropy uncertainty principle If f L 2 (G) = ˆf l 2 (Ĝ) = 1, then f(x) 2 1 log G f(x) dξ ˆf(ξ) 2 log ξ G 1 f(ξ) 0. Proof Differentiate (!) the Hausdorff-Young inequality ˆf L p (Ĝ) f L p (G) at p = 2. 10
From Jensen s inequality we have f(x) 2 1 supp(f) log dξ log f(x) G G ξ G ˆf(ξ) 2 log 1 f(ξ) log supp( ˆf) and so the entropy uncertainty principle implies the Donoho-Stark uncertainty principle. Again, equality is attained for indicators of subgroups (up to translation, modulation, and scalar multiplication). 11
For arbitrary groups G and arbitrary functions f, one cannot hope to do much better than the above uncertainty principles, due to examples such as the subgroup counterexamples f = 1 H. On the other hand, for generic groups and functions, one expects to do a lot better. (For instance, for generic f one has supp(f) = G and supp( ˆf) = Ĝ.) So one expects to obtain improved estimates by imposing additional hypotheses on G or f. 12
For instance, for cyclic groups G = Z/pZ of prime order, which have no non-trivial subgroups, we have Uncertainty principle for Z/pZ (T., 2005) If f : Z/pZ C is non-trivial, then supp(f) + supp( ˆf) p + 1. This is equivalent to an old result of Chebotarev that all minors of the Fourier matrix (e(xξ/p)) 1 x,ξ p are non-singular, which is proven by algebraic methods. The result is completely sharp: if A, B are sets with A + B p + 1, then there exists a function f with supp(f) = A and supp( ˆf) = B. Partial extensions to other groups (Meshulam, 2006). 13
This uncertainty principle has some amusing applications to arithmetic combinatorics, for instance it implies the Cauchy-Davenport inequality (1813) A + B min( A + B 1, p) for subsets A, B of Z/pZ. (Proof: apply the uncertainty principle to functions of the form f g, where f is supported in A, g is supported in B, and supp( ˆf), supp(ĝ) are chosen to have as small an intersection as possible.) Further applications of this type (Sun-Guo, 2008), (Guo, 2008). 14
The uncertainty principle for Z/pZ has the following equivalent interpretation: Uncertainty principle for Z/pZ If f : Z/pZ C is S-sparse (i.e. supp(f) S) and non-trivial, and Ω Z/pZ has cardinality at least S, then ˆf does not vanish identically on Ω. From a signal processing perspective, this means that any S Fourier coefficients of f are sufficient to detect the non-triviality of an S-sparse signal. This is of course best possible. 15
It is crucial that p is prime. For instance, if N is a square, then Z/NZ contains a subgroup of size N, and the indicator function of that subgroup (the Dirac comb) vanishes on N N Fourier coefficients despite being only N-sparse. 16
As a corollary of the uncertainty principle, if f : Z/pZ C is an unknown signal which is known to be S-sparse, and we measure 2S Fourier coefficients ( ˆf(ξ)) ξ Ω of f, then this uniquely determines f; for if two S-sparse signals f, g had the same Fourier coefficients on Ω, then the 2S-sparse difference f g would have trivial Fourier transform on Ω, a contradiction. 17
This is a prototype of a compressed sensing result. Compressed sensing refers to the ability to reconstruct sparse (or compressed) signals using very few measurements, without knowing in advance the support of the signal. (Note that one normally needs all p Fourier coefficients in order to recover a general signal; the point is that sparse signals have a much lower information entropy and thus are easier to recover than general signals.) 18
However, this result is unsatisfactory for several reasons: It is ineffective. It says that recovery of the S-sparse signal f from 2S Fourier coefficients is possible (since f is uniquely determined), but gives no efficient algorithm to actually locate this f. It is not robust. For instance, the result fails if p is changed from a prime to a composite number. One can also show that the result is not stable with respect to small perturbations of f, even if one keeps p to be prime. 19
It turns out that both of these problems can be solved if the frequency set Ω does more than merely detect the presence of a non-trivial sparse signal, but gives an accurate measurement as to how large that signal is. This motivates: Restricted Isometry Principle (RIP). A set of frequencies Ω Z/NZ is said to obey the RIP with sparsity S and error tolerance δ if one has (1 δ) Ω N f 2 L 2 (Z/NZ) ˆf 2 l 2 (Ω) (1+δ) Ω N f 2 L 2 (Z/NZ) for all S-sparse functions f. 20
Note that the factor Ω is natural in view of Plancherel s N theorem; it asserts that Ω always captures its fair share of the energy of a sparse function. It implies that Ω detects the presence of non-trivial S-sparse functions, but is much stronger than this. 21
This principle is very useful in compressed sensing, e.g. Theorem (Candés-Romberg-T., 2005) Suppose Ω Z/NZ obeys the RIP with sparsity 4S and error tolerance 1/4. Then any S-sparse signal f is the unique solution g : Z/NZ C to the problem ĝ Ω = ˆf Ω with minimal L 1 (G) norm. In particular, f can be reconstructed from the Fourier measurements ˆf Ω by a convex optimisation problem. 22
Sketch of proof: If a function g = f + h is distinct from f but has the same Fourier coefficients as f on Ω, use the RIP to show that h has a substantial presence outside of the support of f compared to its presence inside this support, and use this to show that f + h must have a strictly larger L 1 (G) norm than f. 23
Similar arguments show that signal recovery using frequency sets that obey the RIP are robust with respect to noise or lack of perfect sparsity (e.g. if f is merely S-compressible rather than S-sparse, i.e. small outside of a set of size S). There is now a vast literature on how to efficiently perform compressed sensing for various measurement models, many of which obey (or are assumed to obey) the RIP. 24
On the other hand, the RIP fails for many frequency sets. Consider for instance an S-sparse function f : Z/NZ C that is a bump function adapted to the interval { S/2,..., S/2} in Z/NZ. Then ˆf is concentrated in an interval of length about N/S centred at the frequency origin. If Ω avoids this interval (or intersects it with too high or too low of a density), then the RIP fails. Variants of this example show that a frequency set must be equidistributed in various senses if it is to obey the RIP. 25
Uniform uncertainty principle (Candés- T., 2006) A randomly chosen subset Ω of Z/NZ of size CS log 6 N will obey the RIP with high probability (1 O(N C )). Informally, a randomly chosen set of size O(S log 6 N) will always capture its fair share of the energy of any S-sparse function, thus we have a sort of local Plancherel theorem for sparse functions that only requires a random subset of the frequencies. 26
This implies that robust compressed sensing is possible with an oversampling factor of O(log 6 N). This was improved to O(log 5 N) (Rudelson-Vershynin, 2008). In practice, numerics show that an oversampling of 4 or 5 is sufficient. A separate argument (Candés-Romberg-Tao, 2006) shows that (non-robust) compressed sensing is possible w.h.p. with an oversampling factor of just O(log N). 27
The method of proof is related to Bourgain s solution of the Λ p problem, which eventually reduced to understanding the behaviour of maximal exponential sums such as Λ p (Ω) := sup{ ξ Ω c ξ e(xξ/n) L p (Z/NZ) : c l 2 (Ω) = 1} for randomly chosen sets Ω. In particular it relies on the use of a chaining argument used by Bourgain (and also simultaneously by Talagrand). 28
The chaining argument For each individual S-sparse function f, and a random Ω, it is not hard to show that the desired inequality (1 δ) Ω N f 2 L 2 (Z/NZ) ˆf 2 l 2 (Ω) (1 + δ) Ω N f 2 L 2 (Z/NZ) holds with high probability; this is basically the law of large numbers (and is the reason why Monte Carlo integration works), and one can get very good estimates using the Chernoff inequality. The problem is that there are a lot of S-sparse functions in the world, and the total probability of error quickly adds up. 29
One can partially resolve this problem by discretisation: pick an ε, and cover the space Σ of all S-sparse functions in some suitable metric (e.g. L 2 metric) by an ε-net of functions. But it turns out that there are still too many functions in the net to control, and even after controlling these functions, the S-sparse functions in Σ that are near to the net, but not actually on the net, are still not easy to handle. 30
The solution is to chain several nets together, or more precisely to chain together 2 n -nets N n of Σ for each n = 1, 2, 3,.... Instead of controlling the functions f n in each net N n separately, one instead controls the extent to which f n deviates from its parent f n 1 in the next coarser net N n 1, defined as the nearest element in N n 1 to f n. This deviation is much smaller than f itself, in practice, and is easier to control. After getting good control on all of these deviations, one can then control arbitrary functions f by expressing f as a telescoping series of differences f n f n 1. 31