Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery Anna C. Gilbert Department of Mathematics University of Michigan

Connection between... Sparse Approximation and Compressed Sensing

Encoding schemes Image Linear encoding Φ Matrix image independent Nonlinear decoding Approx. Image Nonlinear encoding Ω Linear decoding Matrix image dependent Approx. Image Sparse Approximation Compressed Sensing

Linear encoding, nonlinear decoding Φ x = c highly nonlinear decoding x Φ y = d highly nonlinear decoding y Measure accuracy of decoded signal with respect to x k, best k-term approximation for x (in some orthonormal basis)

Problem statement m as small as possible (Design) matrix Φ: R n R m (Design Φ so that) given Φx = y for any signal x R n, there is an algorithm to recover x with x x p C x x k q

Parameters Number of measurements m Recovery time Approximation guarantee (p, q norms, mixed) One matrix vs. distribution over matrices Explicit construction Tolerance to measurement noise

Comparison with Sparse Approximation Sparse: Given y and Φ, find (sparse) x such that y = Φx. Return x with guarantee Φ x y 2 small compared with y Φx k 2. CS: Given y and Φ, find (sparse) x such that y = Φx. Return x with guarantee x x 2 small compared with 1 k x x k 1.

Analogy: root-finding p with p p ɛ p with f( p) 0 ɛ root = p ɛ ɛ Sparse: Given f (and y = 0), find p such that f (p) = 0. Return p with guarantee f ( p) 0 small. CS: Given f (and y = 0), find p such that f (p) = 0. Return p with guarantee p p small.

Root-finding analogy? Φx = x 1 + 1 2 x 2 2 4 1.5 3.5 1 3 0.5 2.5 0!0.5 2!1 1.5!1.5 1!2!2!1.5!1!0.5 0 0.5 1 1.5 2 0.5

Algorithms for CS Convex optimization Greedy iterative methods

Algorithms for CS Convex optimization Greedy iterative methods Problem formulations Recover entire signal Recover k significant terms

Algorithms for CS Convex optimization Greedy iterative methods Problem formulations Recover entire signal Recover k significant terms Role of probability: Probabilistic method: if choose Φ from distribution, whp it will satisfy certain properties Use randomness in algorithm

CS Geometric methods Suppose Φ satisfies RIP(2, 2k, δ k ) condition: for any 2k-sparse vector x, (1 δ k ) x 2 Φx 2 (1 + δ k ) x 2. Given Φx = y, the solution x to the convex relaxation problem x = argmin z 1 s.t. Φz = Φx satisfies x x 2 C/ k x x k 1. Dense recovery matrices: if draw Φ from random matrix with iid (sub-)gaussian entries or random rows of Fourier matrix, m = O(k log n/k) rows and n columns, Φ satisfies RIP(2) with high probability Example of probabilistic method to generate matrix not constructive Uniform guarantee: one matrix Φ that works for all x [Donoho 04],[Candes-Tao 04, 06],[Candes-Romberg-Tao 05], [Rudelson-Vershynin 06], [Cohen-Dahmen-DeVore 06], and many others...

CS Greedy algorithms Suppose Φ satisfies RIP(2, 2k, δ) condition. Given Φx = y, there are greedy iterative algorithms that produce x with x 0 = k x x 2 C( x x k 2 + 1/ k x x k 1 ). [Tropp-Needell 07],[Blumensath-Davies 08], and others Architecture of algorithms is Greedy Pursuit OMP Maximize: choose λ = set of m columns of Φ with large dot product Φ T r Update: Λ = Λ λ a = linear combination over Φ Λ Iterate: r = y Φa

Computational costs Computational time All dominated by Φ T r, matrix-vector product LP: O(T n), greedy: O(nk log(k/n) log( x 2 )) Storage Have to store Φ k log(n/k) n matrix (unless it has special structure!) Store x as you build it (only non-zero entries or entire vector) Randomness Have to generate Φ Theoretically, all entries are iid, truly random Note: drand48 is pseudo-random

Connection between... Sparse Approximation/Compressive Sensing and Streaming/Sublinear Algorithms

Data streams Data Streams 129.132.69.131 192.16.1.201 172.16.254.1 IP packets header: source/destination address, payload size, protocol, etc. payload: transmitted data IP packets: Backbone link bandwidth = 40 Gbits/sec Over 90% packets are no more than 500 Bytes Monitor 10 million packets/second header: source/destination address, payload size, protocol, etc. payload: transmitted data

Heavy hitters (129.132.69.131, 128) (192.16.1.201, 1024) (172.16.254.1, 64) # Bytes 129.132.69.131 172.16.254.1 192.16.1.201 IP address

Linear measurements (129.132.69.131, 128) =

Linear measurements (129.132.69.131, 128) (192.16.1.201, 1024) =

Linear measurements (129.132.69.131, 128) (192.16.1.201, 1024) (172.16.254.1, 64) =

Streaming algorithms Data (2) Storage space Auxiliary memory (4) what s stored in memory should be composable (1) per-item processing time (3) Time to produce output Resource constraints for (1), (2), and (3) should be poly log(d)

Streaming/Sub-linear algorithms There are sub-linear algorithms for CS running time: O(k polylog n) measurements/storage: O(k log(n/k)) error guarantees: match or l 1 /l 1 or l 2 /l 2 (with different probabilistic constructions) quite different geometric restrictions on Φ use and exploit pseudo-randomness to reduce storage space and speed up algorithms all conditions are sufficient, none seem to be necessary [Gilbert-Guha-Indyk-Kotidis-Muthukrishnan-Strauss 02], [Charikar-Chen-FarachColton 02] [Cormode-Muthukrishnan 04], [Gilbert-Strauss-Tropp-Vershynin 06, 07],[Gilbert-Li-Porat-Strauss 10]

New algorithms, phase transitions, random models d 1d histogram V-OPT (2,1) bi-criteria sparsity k d/2 d/log(d) 1/3 d random subdictionaries L1 optimization k ⅓μ -1 sufficient conditions for OMP and L1 optimizaiton 1 1/ d 1/3 1/2 1-1/d coherence μ 1

Alternative problem formulations + algorithms Dictionary Φ = piecewise constants (in 1 dimension) Extremely high coherence µ 1 1/d

Alternative problem formulations + algorithms Dictionary Φ = piecewise constants (in 1 dimension) Extremely high coherence µ 1 1/d Linear combinations of piecewise constants L1 L2 R2 R1

Alternative problem formulations + algorithms Dictionary Φ = piecewise constants (in 1 dimension) Extremely high coherence µ 1 1/d Linear combinations of piecewise constants L1 L2 R2 R1 but, more natural to count buckets in histograms L1 R1, L2 R2, L3 R3 Use at most twice as many buckets as piecewise constants in sparse representation

V-OPT Theorem There is a dynamic programming algorithm which produces the k-bucket histogram H k that minimizes x H k 2. The algorithm runs in time O(kd 2 ). [Jagadish, et al 98].

V-OPT Theorem There is a dynamic programming algorithm which produces the k-bucket histogram H k that minimizes x H k 2. The algorithm runs in time O(kd 2 ). [Jagadish, et al 98]. Dynamic programming is a different algorithmic technique from both greedy iterative algorithms and convex optimization.

V-OPT OPT(j,k-1) + cost(j+1,d) Idea: within a bucket, mean of signal values is best approximation 1 j+1 d

V-OPT OPT(j,k-1) + cost(j+1,d) Idea: within a bucket, mean of signal values is best approximation Assume last bucket is on [j + 1, d] 1 j+1 d

V-OPT OPT(j,k-1) + cost(j+1,d) 1 j+1 d Idea: within a bucket, mean of signal values is best approximation Assume last bucket is on [j + 1, d] What can we say about the remaining k 1 buckets?

V-OPT OPT(j,k-1) + cost(j+1,d) 1 j+1 d Idea: within a bucket, mean of signal values is best approximation Assume last bucket is on [j + 1, d] What can we say about the remaining k 1 buckets? must be optimal for range [1, j] with k 1 buckets

V-OPT OPT(j,k-1) + cost(j+1,d) 1 j+1 d where opt[d, k] = min {opt[j, k 1] + cost[(j + 1), d]} 1 j<d opt[j, k] = minimum cost of representing the set of values on [1, j] by histogram with k buckets and cost[(j + 1), d] = l 2 error on [j + 1, d]

V-OPT[Jagadish, et al 98] Input. signal x, buckets k Output. k bucket histogram H k x for i=1 to d for j=1 to k for l=1 to i-1 (split pt of j-1 bucket hist. and last bucket) OPT[i, j] = min OPT[i, j], OPT[l,j-1] + cost[l+1,i] OPT k d

Images Signals There is a big difference between 1 and any higher dimension for histograms

Images Signals There is a big difference between 1 and any higher dimension for histograms Finding the optimal k-rectangle histogram in 2d is NP-hard

Images Signals There is a big difference between 1 and any higher dimension for histograms Finding the optimal k-rectangle histogram in 2d is NP-hard There is an efficient algorithm which achieves minimal error (for k rectangles) with at most 4k rectangles. [Muthukrishnan and Strauss, 2003]

Summary Sparse approximation spawned many research directions, activities, applications, in many fields New algorithms, new applications Many new veins to be mined in next lectures