1 Copressive Distilled Sensing: Sparse Recovery Using Adaptivity in Copressive Measureents Jarvis D. Haupt 1 Richard G. Baraniuk 1 Rui M. Castro 2 and Robert D. Nowak 3 1 Dept. of Electrical and Coputer Engineering Rice University Houston TX 77005 2 Dept. of Electrical Engineering Colubia University New York NY 10027 3 Dept. of Electrical and Coputer Engineering University of Wisconsin Madison WI 53706 Abstract The recently-proposed theory of distilled sensing establishes that adaptivity in sapling can draatically iprove the perforance of sparse recovery in noisy settings. In particular it is now known that adaptive point sapling enables the detection and/or support recovery of sparse signals that are otherwise too weak to be recovered using any ethod based on non-adaptive point sapling. In this paper the theory of distilled sensing is extended to highly-undersapled regies as in copressive sensing. A siple adaptive sapling-and-refineent procedure called copressive distilled sensing is proposed where each step of the procedure utilizes inforation fro previous observations to focus subsequent easureents into the proper signal subspace resulting in a significant iproveent in effective easureent SNR on the signal subspace. As a result for the sae budget of sensing resources copressive distilled sensing can result in significantly iproved error bounds copared to those for traditional copressive sensing. I. INTRODUCTION Let x R n be a sparse vector supported on the set S = {i : x i 0} where S = s n and consider observing x according to the linear observation odel y = Ax + w 1 where A is an n real-valued atrix possibly rando that satisfies E [ ] A 2 iid F n and where wi N 0 σ 2 for soe σ 0. This odel is central to the eerging field of copressive sensing CS which deals priarily with recovery of x in highly-underdeterined settings that is where the nuber of easureents n. Initial results in CS establish a rather surprising result using certain observation atrices A for which the nuber of rows is a constant ultiple of s log n it is possible to recover x exactly fro {y A} and in addition the recovery can be accoplished by solving a tractable convex optiization [1] [3]. Matrices A for which this exact recovery is possible are easy to construct in practice. For exaple atrices whose entries are i.i.d. realizations of certain zero-ean distributions Gaussian syetric Bernoulli etc. are sufficient to allow this recovery with high probability [2] [4]. In practice however it is rarely the case that observations are perfectly noise-free. In these settings rather than attept This work was partially supported by the ARO grant no. W911NF-09-1- 0383 the NSF grant no. CCF-0353079 and the AFOSR grant no. FA9550-09-1-0140. to recover x exactly the goal becoes to estiate x to high accuracy in soe etric such as l 2 nor [5] [6]. One such estiation procedure is the Dantzig selector proposed in [6] which establishes that CS recovery reains stable in the presence of noise. We state the result here as a lea. Lea 1 Dantzig selector. For = Ωs log n generate a rando n atrix A whose entries are i.i.d. N 0 1/ and collect observations y according to 1. The estiate x = arg in z R n z l 1 subject to A T y Az l < λ where λ = Θσ log n satisfies x x 2 l 2 = Osσ 2 log n with probability 1 On C 0 for soe constant C 0 > 0. Reark 1. The constants in the above can be specified explicitly or bounded appropriately but we choose to present the results here and where appropriate in the sequel in ters of scaling relationships 1 in the interest of siplicity. On the other hand suppose that an oracle were to identify the locations of the nonzero signal coponents or equivalently the support S prior to recovery. Then one could construct the least-squares estiate x LS = A T S A S 1 A T S y where A S denotes the subatrix of A fored fro the coluns indexed by the eleents of S. The error of this estiate is x LS x 2 l 2 = Osσ 2 with probability 1 On C 1 for soe C 1 > 0 as shown in [6]. Coparing this oracleassisted bound with the result of Lea 1 we see that the priary difference is the presence of the logarithic ter in the error bound of the latter which can be interpreted as the searching penalty associated with having to learn the correct signal subspace. Of course the signal subspace will rarely if ever be known a priori. But suppose that it were possible to learn the signal subspace fro the data in a sequential adaptive fashion as the data are collected. In this case sensing energy could be focused only into the true signal subspace gradually iproving the effective easureent SNR. Intuitively one ight expect that this type of procedure could ultiately yield an estiate whose accuracy would be closer to that of 1 Recall that for functions f = fn and g = gn f = Og eans f cg for soe constant c for all n sufficiently large f = Ωg eans f c g for a constant c for all n sufficiently large and f = Θg eans that f = Og and f = Ωg. In addition we will use the notation f = og to indicate that li n f/g = 0.
2 the oracle-assisted estiator since the effective observation atrix would begin to assue the structure of A S. Such adaptive copressive sapling ethods have been proposed and exained epirically [7] [9] but to date the perforance benefits of adaptivity in copressive sapling have not been established theoretically. In this paper we take a step in that direction by analyzing the perforance of a ulti-step adaptive saplingand-refineent procedure called copressive distilled sensing CDS extending our own prior work in distilled sensing where the theoretical advantages of adaptive sapling in uncopressed settings were quantified [10] [11]. Our ain results here guarantee that for signals having not too any nonzero entries and for which the dynaic range is not too large a total of Os log n adaptively-collected easureents yield an estiator that with high probability achieves the Osσ 2 error bound of the oracle-assisted estiator. The reainder of the paper is organized as follows. The CDS procedure is described in Sec. II and its perforance is quantified as a theore in Sec. III. Extensions and conclusions are briefly described in Sec. IV and a sketch of the proof of the ain result and associated leata appear in the Appendix. Algorith 1: Copressive distilled sensing CDS. Input: Nuber of observation steps k; R j j = 1... k such that k j=1 Rj n; j j = 1... k such that k j=1 j ; Initialize: Initial index set I 1 = {1 2... n}; Distillation: for j = 1 to k do Copute τ j = R j / I j ; Construct { A j where A j N 0 τ j j uv iid u {1... j } v I j ; 0 u {1... j } v / I j Collect y j = A j x + w j ; Copute x j = A j T y j ; Refine I j+1 = {i I j : x j i > 0}; end Output: Distilled observations { y j A j} k j=1 ; II. COMPRESSIVE DISTILLED SENSING In this section we describe the copressive distilled sensing CDS procedure which is a natural generalization of the distilled sensing DS procedure [10] [11]. The CDS procedure given in Algorith 1 is an adaptive procedure coprised of an alternating sequence of sapling or observation steps and refineent or distillation steps and for which the observations are subject to a global budget of sensing resources or sensing energy that effectively quantifies the average easureent precision. The key point is that the adaptive nature of the procedure allows for sensing resources to be allocated nonuniforly; in particular proportionally ore of the resources can be devoted to subspaces of interest as they are identified. In the jth sapling step for j = 1... k we collect easureents only at locations of x corresponding to indices in a set I j where I 1 = {1... n} initially. The jth refineent step for j = 1... k 1 identifies the set of locations I j+1 I j for which the corresponding signal coponents are to be easured in step j + 1. It is clear that in order to leverage the benefit of adaptivity the distillation step should have the property that I j+1 contains ost or ideally all of the indices in I j that correspond to true signal coponents. In addition and perhaps ore iportantly we also want the set I j+1 to be significantly saller than I j since in that case we can realize an SNR iproveent fro focusing our sensing resources into the appropriate subspace. In the DS procedure exained in [10] [11] observations were in the for of noisy saples of x at any location i {1... n} at each step j. In that case it was shown a siple refineent operation identifying all locations for which the corresponding observation exceeded a threshold was sufficient to ensure that with high probability I j+1 would contain ost of the indices in I j corresponding to true signal coponents but only about half of the reaining indices even when the signal is very weak. On the other hand here we utilize a copressive sensing observation odel where at each step the observations are in the for of a lowdiensional vector y R with n. In an attept to iic the uncopressed case here we propose a siilar refineent step applied to the back-projection estiate A j T y j = x j R n which can essentially be thought of as one of any possible estiates or reconstructions of x that can be obtained fro y j and A j. The results in the next section quantify the iproveents that can be achieved using this approach. III. MAIN RESULTS To state our ain results we set the input paraeters of Algorith 1 as follows. Choose α 0 1/3 let b = 1 α/1 2α and let k = 1 + log b log n. Allocate sensing resources according to R j = { αn 1 2α 1 α αn j 1 j = 1... k 1 j = k and note that this allocation guarantees that R j+1 /R j > 1/2 and k j=1 Rj n. The latter inequality ensures that the total sensing energy does not exceed the total sensing energy used in conventional CS. The nuber of easureents acquired in each step is { } j ρ0 s log n/k 1 j = 1... k 1 = ρ 1 s log n j = k for soe constants ρ 0 which depends on the dynaic range and ρ 1 sufficiently large so that the results of Lea 1 hold. Note that = Os log n the sae order as the iniu nuber of easureents required by conventional CS. }
3 Our ain result of the paper stated below and proved in the Appendix quantifies the error perforance of one particular estiate obtained fro adaptive observations collected using the CDS procedure. Theore 1. Assue that x R n is sparse with s = n β/ log log n for soe constant 0 < β < 1. Furtheore assue that each non-zero coponent of x satisfies σµ x i Dσµ for soe µ > 0. Here σ is the noise standard deviation D > 1 is the dynaic range of the signal and µ 2 is the SNR. Adaptively easure x according to Algorith 1 with the input paraeters as specified above and construct the estiator x CDS by applying the Dantzig selector with λ = Θσ to the output of the algorith i.e. with A = A k and y = y k. 1 There exists µ 0 = Ω log n/ log log n such that if µ µ 0 then x CDS x 2 l 2 = Osσ 2 with probability 1 On C 0 / log log n for soe C 0 > 0. 2 There exists µ 1 = Ω log log log n such that if µ 1 µ < µ 0 then x CDS x 2 l 2 = Osσ 2 with probability 1 Oe C 1 µ2 for soe C 1 > 0. 3 If µ < µ 1 then x CDS x 2 l 2 = Osσ 2 log log log n with probability 1 On C 2 for soe C 2 > 0. In words when the SNR is sufficiently large the estiate achieves the error perforance of the oracle-assisted estiator albeit with a lower slightly sub-polynoial convergence rate. For a class of slightly weaker signals the oracle-assisted error perforance is still achieved but with a rate of convergence that is inversely proportional to the SNR. Note that we ay suarize the results of the theore with the general clai x CDS x 2 l 2 = Osσ 2 log log log n with probability 1 o1. It is worth pointing out that for any probles of practical interest the log log log n ter can be negligible whereas log n is not; for exaple log log log10 6 < 1 but log10 6 14. IV. EXTENSIONS AND CONCLUSIONS Although the CDS procedure was specified under the assuption that the nonzero signal coponents were positive it can be easily extended to signals having negative entries as well. In that case one could split the budget of sensing resources in half executing the procedure once as written and again replacing the refineent step by I j+1 = {i I j : x j i < 0}. In addition the results presented here also apply if the signal is sparse another basis. To ipleent the procedure in that case one would generate the A j as above but observations of x would be obtained using A j T where T R n n is an appropriate orthonoral transforation atrix discrete wavelet or cosine transfor for exaple. In either case the qualitative behavior is the sae observations are collected by projecting x onto a superposition of basis eleents fro the appropriate basis. We have shown here that the copressive distilled sensing procedure can significantly iprove the theoretical perforance of copressive sensing. In experients not shown here due to space liitations we have found that CDS can perfor significantly better than CS in practice like siilar previously proposed adaptive ethods [7] [9]. We reark that our theoretical analysis shows that CDS is sensitive to the dynaic range of the signal. This is an artifact of the ethod for obtaining the signal estiate x j at each step. As alluded at the end of Section II x j could be obtained using any of a nuber of ethods including for exaple Dantzig selector estiation with a saller value of λ or other ixed-nor reconstruction techniques such as LASSO with sufficiently sall regularization paraeters. Such extensions will be explored in future work. A. Leata V. APPENDIX We first establish several key leata that will be used in the sketch of the proof of the ain result. In particular the first two results presented below quantify the effects of each refineent step. Lea 2. Let x R n be supported on S with S = s and let x S denote the subvector of x coposed of entries of x whose indices are in S. Let A be an n atrix whose entries are i.i.d. N 0 τ/ for soe 0 < τ in τ and let A S and A S c be subatrices of A coposed of the coluns of A corresponding to the indices in the sets S and S c respectively. Let w R be independent of A and have i.i.d. N 0 σ 2 entries. For the z 1 vector U = A T S ca Sx S + A T Scw where z = S c = n s we have 1/2 ɛ z z j=1 1 {U i>0} 1/2 + ɛ z for any ɛ 0 1/2 with probability at least 1 2 exp 2ɛ 2 z. Proof: Define Y = Ax + w = A S x S + w and note that given Y the entries of U = A T S cy are i.i.d. N 0 Y 2 2τ/. Thus when Y 0 we have PrU i > 0 = 1/2 for all i = 1... z. Let T i = 1 {Ui>0} and apply Hoeffding s inequality to obtain that for any ɛ 0 1/2 z i z 2 > ɛz Y : Y 0 2 exp 2ɛ 2 z. Now we integrate to obtain z i z 2 > ɛz 2 exp 2ɛ 2 z dp Y + Y :Y 0 2 exp 2ɛ 2 z. Y :Y =0 1 dp Y The last result follows fro the fact that the event Y = 0 has probability zero since Y is Gaussian-distributed. Lea 3. Let x S x S A A S and w be as defined in the previous lea. Assue further that the entries of x satisfy σµ x i Dσµ for i S for soe µ > 0 and fixed D > 1. Define = exp 32 sd 2 + µ 2 < 1 /τ in then for the s 1 vector V = A T S A Sx S + A T S w either of the following bounds are valid: s Pr 1 {Vi>0} s 2s 2
4 or s Pr 1 {Vi>0} < s1 3 4. Proof: Given A i the ith colun of A we have V i N A i 2 l 2 x i A i 2 l 2 τ s x 2 j + σ 2 j=1 j i and so by a standard Gaussian tail bound A i l2 x i PrV i 0 A i = Pr N 0 1 > τ s j=1 x 2 j + σ2 j i A i 2 l exp 2 x 2 i 2τ x 2 / + σ 2 Now we can leverage a result on the tails of a chi-squared rando variable fro [12] to obtain that for any γ 0 1 Pr A i 2 1 γτ exp γ 2 /4. Again we eploy conditioning to obtain PrV i 0 1 dp Ai A i : A i 2 1 γτ + PrV i 0 A i dp Ai A i : A i 2 >1 γτ exp γ2 + exp τ1 γx2 i 4 exp γ2 4 + exp 2τ x 2 / + σ 2 τ1 γµ 2 2τsD 2 µ 2 / + 1 where the last bound follows fro the conditions on the x i. Now to siplify we choose γ = γ 0 1 to balance the two ters obtaining γ = sd 2 + 1 τµ 2 1 + 2 sd 2 + τµ 2 1. Using the fact that 1 + 2t 1 t > 1 2 t for t > 1 we can conclude γ > 1 sd 2 + 1/2 2 τµ 2 since s > 1 by assuption. Now using the fact that τ τ in we have that PrV i 0 2 2 where = exp 32 sd 2 + µ 2. /τ in The first result follows fro s s Pr 1 {Vi >0} s = Pr {V i 0} s ax i {1...s} Pr V i 0 2s 2. For the second result let us siplify notation by introducing the variables T i = 1 {Vi>0} and t i = E [T i ]. By Markov s Inequality we have [ ] s s s s i t i > p p 1 E T i t i Now note that T i t i = s p 1 E [ T i t i ] p 1 s ax i {1...s} E [ T i t i ]. { 1 P Vi > 0 V i > 0 P V i > 0 V i 0 and so E [ T i t i ] 2P V i 0. Thus we have that ax i {1...s} E [ T i t i ] = 2 2 and so s s i t i > p 4p 1 s 2. Now let p = s to obtain s i < s t i s 4. Since t i = 1 Pr V i 0 we have s t i s1 2 2 and thus s i < s1 2 2 4. The result follows fro the fact that 2 2 + < 3. Lea 4. For 0 < p < 1 and q > 0 we have 1 p q 1 qp/1 p. Proof: We have log 1 p q = q log 1 p = q log 1 + p/1 p qp/1 p where the last bound follows fro the fact that log 1 + t t for t 0. Thus 1 p q exp qp/1 p 1 qp/1 p the last bound following fro the fact e t 1 + t for all t R. B. Sketch of Proof of Theore 1 A j A j S j T w j and To establish the ain results of the paper we will first show that the final set of observations of the CDS procedure is with high probability equivalent in distribution to a set of observations of the for 1 but with different paraeters saller effective diension n eff and effective noise power σeff 2 and for which soe fraction of the original signal coponents ay be absent. To that end let S j = S I j and Z j = S c I j for j = 1... k denote the subsets of indices of S and its copleent respectively that reain to be easured in step j. Note that at each step of the procedure the back-projection estiate x j = A j T A j x + A j T w j can be deco- T j posed into x S j = S A x j S j S j + T x Z j = A j j Z A x j S j S j + subvectors are precisely of the for specified in the conditions of Leas 2 and 3. A j Z j T w j and that these
5 Let z j = Z j and s j = S j and in particular note that s 1 = s and z 1 = z = n s. Choose the paraeters of the CDS algorith as specified in Section III. Iteratively applying Lea 2 we have that for any fixed ɛ 0 1/2 the bounds 1/2 ɛ j 1 z z j 1/2 + ɛ j 1 z hold siultaneously for all j = 1 2... k with probability at least 1 2k 1 exp 2zɛ 2 1/2 ɛ k 2 which is no less than 1 O exp c 0 n/ log c 1 n for soe constants c 0 > 0 and c 1 > 0 for n sufficiently large 2. As a result with the sae probability the total nuber of locations in the set I j satisfies I j s 1 +z 1 1 2 + ɛ j 1 for all j = 1 2... k. Thus we can lower bound τ j = R j / I j at each step by τ j αn1 2α/1 α j 1 s+z1+2ɛ/2 j 1 j = 1... k 1 αn s+z1+2ɛ/2 j 1 j = k. Now note that when n is sufficiently large 3 we have s z 1/2 + ɛ j 1 holding for all j = 1... k. Letting ɛ = 1 3α/2 2α we can siplify the bounds on τ j to obtain that τ j α/2 for j = 1... k 1 and τ k α log n/2. The salient point to note here is the value of τ k and in particular its dependence on the signal diension n. This essentially follows fro the fact that the set of indices to easure decreases by a fixed factor with each distillation step and so after Olog log n steps the nuber of indices to easure is saller than in the initial step by a factor of about log n. Thus for the sae allocation of resources R 1 = R k the SNR of the final set of observations is larger than that of the first set by a factor of log n. Now the final set of observations is y k = A k x k +w k where x k R n eff for soe n eff < n is supported on the set S k = S I k A k is an k n eff atrix and the w i are i.i.d. N 0 σ 2. We can divide throughout by τ k to obtain the equivalent stateent ỹ = à x + w where now the entries of à are i.i.d. N 0 1/ and the w i are i.i.d. N 0 σ 2 where σ 2 2σ 2 /α log n. To bound the overall squared error we consider the variance associated with estiating the coponents of x using the Dantzig selector cf. Lea 1 as well as the squared bias arising fro the fact that soe signal coponents ay not be present in the final support set S k. In particular a bound for the overall error is given by x x 2 l 2 = x x + x x 2 l 2 2 x x 2 l 2 + 2 x x 2 l 2. We can bound the first ter by applying the result of Lea 1 to obtain that for ρ 1 sufficiently large x x 2 l 2 = Osσ 2 holds with probability 1 On C0 for soe C 0 > 0. Now let δ = S S k /s denote the fraction of true signal coponents that are rejected by the CDS procedure. Then we have x x 2 l 2 = Osσ 2 δµ 2 and so overall we have x x 2 l 2 = Osσ 2 + sσ 2 δµ 2 with probability 1 On C0. The ethod for bounding the second ter in the error bound varies 2 In particular we require n c 0 log log log nlog nc 1/1 n c 2 / log log n 1 where c 0 c 1 and c 2 are positive functions of ɛ and β. 3 In particular we require n 1 + log n log log n/log log n β. depending on the signal aplitude µ; we consider three cases below. 1 µ 8D 3/α log n/ log log n: Conditioned on the event that the stated lower-bounds for τ j are valid we can iteratively apply Lea 3 taking τ in = α/2. For ρ 0 = 96D 2 / log b where b is the paraeter fro the expression for k let j = ρ 0 s log n/ log b log n. Then we obtain that for all n sufficiently large δ = 0 with probability at least 1 On C 0 / log log n for soe constant C 0 > 0. Since this ter governs the rate we have overall that x x 2 l 2 = Osσ 2 holds with probability 1 On C 0 / log log n as claied. 2 16 2/α log b log log log n µ < 8D 3/α log n/ log log n: For this range of signal aplitude we will need to control δ explicitly. Conditioned on the event that the lower-bounds for τ j hold we iteratively apply Lea 3 where for ρ 0 = 96D 2 / log b we let j = ρ 0 s log n/ log b log n. Now we invoke Lea 4 to obtain that for n sufficiently large δ = 1 1 3 k 1 = Oe C 1 µ2 with probability at least 1 Oe C 1 µ2 for soe C 1 > 0. It follows that δµ 2 is O1 and so overall x x 2 l 2 = Osσ 2 with probability 1 Oe C 1 µ2. 3 µ < 16 2/α log b log log log n: Invoking the trivial bound δ 1 it follows fro above that for n sufficiently large the error satisfies x x 2 l 2 = Osσ 2 log log log n with probability 1 On C 2 for soe constant C 2 > 0 as claied. REFERENCES [1] E. J. Candès J. Roberg and T. Tao Robust uncertainty principles: Exact signal reconstruction fro highly incoplete frequency inforation IEEE Trans. Infor. Theory vol. 52 no. 2 pp. 489 509 Feb. 2006. [2] D. L. Donoho Copressed sensing IEEE Trans. Infor. Theory vol. 52 no. 4 pp. 1289 1306 Apr. 2006. [3] E. J. Candès and T. Tao Near-optial signal recovery fro rando projections: Universal encoding strategies? IEEE Trans. Infor. Theory vol. 52 no. 12 pp. 5406 5425 Dec. 2006. [4] R. Baraniuk M. Davenport R. A. DeVore and M. Wakin A siple proof of the restricted isoetry property for rando atrices Constructive Approxiation 2008. [5] J. Haupt and R. Nowak Signal reconstruction fro noisy rando projections IEEE Trans. Infor. Theory vol. 52 no. 9 pp. 4036 4048 Sept. 2006. [6] E. J. Candès and T. Tao The Dantzig selector: Statistical estiation when p is uch larger than n Ann. Statist. vol. 35 no. 6 pp. 2313 2351 Dec. 2007. [7] S. Ji Y. Xue and L. Carin Bayesian copressive sensing IEEE Trans. Signal Processing vol. 56 no. 6 pp. 2346 2356 June 2008. [8] R. Castro J. Haupt R. Nowak and G. Raz Finding needles in noisy haystacks in Proc. IEEE Conf. Acoustics Speech and Signal Proc. Honolulu HI Apr. 2008 pp. 5133 5136. [9] J. Haupt R. Castro and R. Nowak Adaptive sensing for sparse signal recovery in Proc. IEEE 13th Digital Sig. Proc./5th Sig. Proc. Education Workshop Marco Island FL Jan. 2009 pp. 702 707. [10] J. Haupt R. Castro and R. Nowak Adaptive discovery of sparse signals in noise in Proc. 42nd Asiloar Conf. on Signals Systes and Coputers Pacific Grove CA Oct. 2008 pp. 1727 1731. [11] J. Haupt R. Castro and R. Nowak Distilled sensing: Selective sapling for sparse signal recovery in Proc. 12th International Conference on Artificial Intelligence and Statistics AISTATS Clearwater Beach FL Apr. 2009 pp. 216 223. [12] B. Laurent and P. Massart Adaptive estiation of a quadratic functional by odel selection Ann. Statist. vol. 28 no. 5 pp. 1302 1338 Oct. 2000.