Lecture 16: Compressed Sensing Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 16 1 / 12
Review of Johnson-Lindenstrauss Unsupervised learning technique key insight: for any m points {xi } in R d, ε (0, 3) there is a W R k d, k = O(log(m)/ε 2 ) s.t. 1 ε Wx i 2 x i 1 + ε 2 i m ( ) further, if we draw W ij N(0, 1/k) independently then (*) holds w/high prob for any m-point set! J-L is a data-oblivious technique (unlike PCA...) Applications: NN (1 + O(ε))-approximate nearest neighbors at query point x, if true NN has distance approx NN has distance (1 + O(ε)) speedup from O(md) to O(mk) Applications: SVM if data is separable with margin γ, random projection to dim O(1/γ 2 ) results w/high prob in data separable with margin γ/2 effective dimension is O(1/γ 2 ) independent of d Kontorovich and Sabato (BGU) Lecture 16 2 / 12
This lecture: sparsity For x R d, define the pseudo-norm: x 0 := {i d : x i 0} x 0 is the number of non-zero components of x. If x 0 s, can compress using s (index,value) pairs. Compression is lossless: exact reconstruction is possible. Consider cases where measuring/transmitting/storing x is costly. Field sensors MRI Wireless ultrasound We measure the whole x R d but only need 2s numbers! Can we use few measurements to find out x? We don t know the location of the non-zero coordinates! Kontorovich and Sabato (BGU) Lecture 16 3 / 12
Compressed Sensing Compressed sensing: recovering the sparse signal x while making only O( x 0 log(d)) measurements. Measurement: linear map u : x R. Equivalently, measurement is u, x, for u R d. Linear measurement is easy in many physical devices. In return, reconstruction will be more expensive. Useful when weak sensing devices transmit to powerful decoders or when measurements are expensive. Kontorovich and Sabato (BGU) Lecture 16 4 / 12
Restricted Isometry Property (RIP) We will define a property of matrices called RIP Any matrix W R k d compresses x to k-dimensional Wx If W has RIP then x can be recovered from Wx The recovery can be done efficiently We will show: A random k d matrix, where entries are i.i.d. Gaussian, and k O( x 0 log(d)), has RIP w. high probability Kontorovich and Sabato (BGU) Lecture 16 5 / 12
Restricted Isometry Property (RIP) definition Definition: Matrix W R k d is (ε, s)-rip if for all 0 x R d and x 0 s, 1 ε Wx 2 x 2 1 + ε Looks familiar? Like J-L, W approximately preserves the norm of x. Unlike J-L: RIP: holds for all s-sparse vectors J-L: holds for a finite set of exp(kε 2 ) vectors J-L: does not require s-sparsity. Kontorovich and Sabato (BGU) Lecture 16 6 / 12
RIP and lossless compression Theorem Matrix W R k d is (ε, s)-rip if for all 0 x R d and x 0 s, 1 ε Wx 2 x 2 1 + ε. Let ε (0, 1), W be (ε, 2s)-RIP matrix. If x R d is s-sparse and y = Wx then x = argmin v R d :Wv=y v 0. Proof: Suppose (for contradiction) that some s-sparse x x satisfies y = W x. Then x x 0 2s. Apply RIP to x x: W (x x) 2 x x 2 1 ɛ. x x > 0 and W (x x) 2 = 0 = 1 ε 0; contradiction Kontorovich and Sabato (BGU) Lecture 16 7 / 12
RIP and efficient reconstruction Theorem By RIP theorem, x = argmin v R d :Wv=y v 0.. To recover x, choose sparsest element of {v : Wv = y} This recovery procedure is not efficient (why?) Let s do something else. Let ε (0, 1), W be (ε, 2s)-RIP matrix. If ε < 1 1+ 2 then x = argmin v 0 = argmin v 1. v R d :Wv=y v R d :Wv=y Why is this good news? Convex optimization! Again, l 1 regularization encourages sparsity as in LASSO (l 1 regularized regression). Kontorovich and Sabato (BGU) Lecture 16 8 / 12
Constructing RIP matrices Explicit constructions are not known Do instead an efficient random construction which is likely to work. Theorem fix ε, δ (0, 1), s [d]. Set k 100 s log(40d/(δε)) ε 2 Draw W R k d via W ij N(0, 1/k) independently. Then, w. probability 1 δ, the matrix W is (ε, s)-rip. Same random matrix as J-L! (k is different) Full compressed sensing process: Generate random W R k d Get the k measurements Wx Find x := argminv R d :Wv=y v 1. Used Õ(s log(d)/ɛ2 ) measurements instead of d. Kontorovich and Sabato (BGU) Lecture 16 9 / 12
Latent sparsity x 0 := {i d : x i 0} If x 0 s, can compress using s (index,value) pairs Sometimes the sparsity of x is hidden : x = Uα where U R d d is orthogonal and α 0 s. U is fixed and assumed known. x has a sparse representation in basis U. Holds for many natural signals. E.g., JPEG image compression exploits sparsity in wavelet basis. Can we still compress x using O(s log(d)) measurements? Find W such that W := WU is RIP; Measure Wx WUα W α. Compute α = argmin v R d :W v=y v 0 Compute x = Uα. How to get W such that WU is RIP? Kontorovich and Sabato (BGU) Lecture 16 10 / 12
Latent sparsity How to get W R k d such that W = WU is RIP? Claim: If W ij N(0, 1/k) independently then so is W ij! Proof: W ij = k t=1 W itu tj N(0, k t=1 U2 tj /k) = N(0, 1/k). Independence: Fact: For Gaussians X, Y, they are independent iff E[XY ] = 0. Check independence of entries in W : k E[W ijw i j ] = E[ t,t =1 W it U tj W i t U t j ] = k t,t =1 U tj U t j E[W itw i t ] = 0. Conclusion: Compressed sensing with latent sparsity can use same W. Regardless of U! Kontorovich and Sabato (BGU) Lecture 16 11 / 12
Compressed Sensing summary Assumes that data x R d has s-sparse representation in some basis U R d d Goal: recover x from fewer measurements than d Method: measure using W R k d which is an RIP matrix. Set k s log(d), where x 0 s. Get k measurements Wx = WUα. Recover original α efficiently using convex optimization Get x = Uα. RIP matrices W can be efficiently constructed. Construction uses i.i.d Gaussian entries. Construction produce an RIP matrix with high probability Kontorovich and Sabato (BGU) Lecture 16 12 / 12