The Dantzig Selector Emmanuel Candès California Institute of Technology Statistics Colloquium, Stanford University, November 2006 Collaborator: Terence Tao (UCLA)
Statistical linear model y = Xβ + z y R n : vector of observations (available to the statistician) X R n p : data matrix (known) β R p : object of interest (vector of parameters) z R n : stochastic error; here, z i i.i.d. N(0, σ 2 ) Goal estimate β from y (not Xβ)
Classical setup Number of parameters p fixed Number of observations n very large Asymptotics with n
Modern setup About as many parameters as observations: n p Many more parameters than observations: n < p or n p Very important subject in statistics today: high-dimensional data Conferences 100 s of articles
Example of interest: MRI angiography (sketch) β: N N image of interest = p = N 2 y: noisy Fourier coefficients of image 22 radial lines - 4% coverage for 512 by 512 image: n =.04p - 2% coverage for 1024 by 1024 image: n =.02p
Examples of modern setup Biomedical imaging fewer measurements than pixels Analog to digital conversion (ADCs) Inverse problems Genomics low number of observations (in the tens) total number of gene assayed (and considered as possible regressors) easily in the thousands Curve/surface estimation recovery of a continuous time curve from a finite number of noisy samples
Underdetermined systems Fundamental difficulty Fewer equations than unknowns Noiseless setup p > n y = X β Seems highly problematic
Sparsity as a way out? What if β is sparse or nearly sparse? Perhaps more observations than unknowns y = X * * Warning e.g. p = 2n = 20 * X = 1... 1 y 1 = β 1 y 2 = β 2 =.. y 10 = β 10 β Is it just smoke? What about β 11, β 12,..., β 20?
Encouraging example, I Reconstruct by solving min b b l1 := i b i s.t. Xb = y = Xβ β, 15 nonzeros n = 30 Fourier samples perfect recovery
Encouraging example, II Reconstruct β with min β TV := β(i 1, i 2 ) s.t. Xb = y = Xβ b R p p i 1,i 2 Original Phantom (Logan Shepp) Naive Reconstruction Reconstruction: min BV + nonnegativity constraint 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 original Filtered backprojection perfect recovery
A theory behind these results Consider the underdetermined system y = Xβ where β is S-sparse (only S of the coordinates of β are nonzero). Theorem ˆβ = argmin b R p b l1 subject to Xβ = y Perfect reconstruction ˆβ = β, provided X obeys a uniform uncertainty principle. (Tomorrow s talk)
Statistical estimation In real applications, data are corrupted Sensing model: y = Xβ + z Hopeless? (most of the singular values of X are zero)
The Dantzig selector Assume columns of X are unit-normed y = Xβ + z, z j iid N(0, σ 2 ) Dantzig selector (DS) ˆβ = argmin b l1 s.t. (X r) i λ p σ, for all i r is the vector of residuals: r = y Xb λ p = 2 log p (makes the true β feasible with large prob.)
Why l 1? Theory of noiseless case Precedents in Statistics
Why X r small (and not r)? X r l λ σ Invariance Imagine to rotate the data y = Xβ + z with U: Uy = UXβ + Uz ỹ = Xβ + z Estimator should not change The DS is invariant Correlations X r measures the correlation of the residuals with the predictors Don t leave jth predictor out when r, X j is large! Not the size of r that matters but that of X r; e.g. y = X j and σ > X j l
The Dantzig selector as a linear program min i Can be recast as a linear program (LP) b i s.t. X (y Xb) j λ p σ min i u i s.t. u b u λ p σ X (y Xb) λ p σ u, b R p (a b means a i b i for all i s)
Related approaches Lasso (Tibshirani, 96) min y Xb 2 l 2 s.t. b l1 t Basis pursuit denoising (Chen, Donoho, Saunders, 96) Dantzig selector is different min 1 2 y Xb 2 l 2 + Λ p σ b l1
The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability ˆβ β 2 l 2 O(2 log p) S σ 2
The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability ˆβ β 2 l 2 O(2 log p) S σ 2
The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability No blow up! ˆβ β 2 l 2 O(2 log p) S σ 2 σ Nicely degrades as noise level increases
The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability No blow up! ˆβ β 2 l 2 O(2 log p) S σ 2 σ Nicely degrades as noise level increases S MSE is proportional to the true (unknown) number of parameters (information content) = Adaptivity
The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability No blow up! ˆβ β 2 l 2 O(2 log p) S σ 2 σ Nicely degrades as noise level increases S MSE is proportional to the true (unknown) number of parameters (information content) = Adaptivity 2 log p price to pay for not knowing where the significant coordinates are
Understanding the condition: restricted isometries X M R n M : columns of X with indices in M T Restricted isometry constant δ S For each S, δ s R is the smallest number s. t. (1 δ S ) Id XMX M (1 + δ S ) Id, M, M S. Sparse subsets of column vectors are approximately orthonormal.
Understanding the condition: the number θ M φ M θ = cosφ θ S,S is the smallest quantity for which Xb, X b θ S,S b l2 b l2 b, b supported on disjoint subsets M, M {1,..., p} with M S, M S
The condition is almost necessary Uniform Uncertainty Principle To estimate a S-sparse vector β we need X s.t. δ 2S + θ S,2S < 1 With δ 2S = 1, one can have h R 2p, 2S-sparse with Xh = 0 Decompose h = β β each S-sparse with X(β β ) = 0 Xβ = Xβ = Model is not identifiable!
Comparing with an oracle Oracle informs us of the support M of the unknown β. Ideal LS estimate x I LS = argmin y X M b 2 l 2, β I LS = (X MX M ) 1 X My Ideal because unachievable MSE E β I LS β 2 l 2 = Tr(X MX M ) 1 σ 2 S σ 2 With large probability, DS obeys ˆβ β 2 l 2 = O(log p) E β β I LS 2 l 2
Is this good enough? A more powerful oracle Oracle lets us observe all the coordinates of the unknown vector β directly y i = β i + z i, z i i.i.d. N(0, σ 2 ) Oracle tells us which coordinates are above the noise level Ideal shrinkage estimator (Donoho and Johnstone, 94) { ˆβ i I 0, β i < σ = y i, β i σ MSE of ideal estimator E ˆβ I β 2 l 2 = i E( ˆβ I i β i ) 2 = i min(β 2 i, σ 2 )
Main claim: oracle inequality Same assumptions as before. Then with large probability ( ˆβ β 2 l 2 O(2 log p) σ 2 + ) min(βi 2, σ 2 ). i DS does not have any oracle information Fewer observations than variables Direct observations of β are not available = Yet, nearly achieves the ideal MSE! Unexpected feat This is optimal. No estimator can essentially do better Extensions to compressible signals Other work: Haupt and Nowak (05)
An other oracle-style benchmark: Ideal model selection Subset M {1,..., p}. Span of M Least squares estimate V M := {b R p : b i = 0 i / M}. ˆβ M = argmin b VM y Xb 2 l 2 Oracle selects the ideal least-squares estimator β β = argmin M {1,...,p} E β ˆβ M 2 l 2 Ideal because unachievable The risk of the ideal estimator obeys E β β 2 l 2 1 min(βi 2, σ 2 ) 2 i
DS vs. ideal model selection With large probability, Dantzig selector obeys ˆβ β 2 l 2 = O(log p) E β β 2 l 2 DS does not have any oracle information and yet, nearly selects the best model nearly achieves the ideal MSE!
Extensions to compressible signals Popular model: power-law β (1) β (2)... β (p) β (i) R i s Ideal risk min(βi 2, σ 2 ) = min 0 m p mσ2 + β 2 (i) min m 0 m p σ2 + R 2 m 2s+1 i i>m Optimal m is the number parameters we need to estimate: m = {i : β i > σ} = Could perhaps mimic ideal risk if condition on X is valid with m
Extensions to compressible signals Popular model: power-law β (1) β (2)... β (p) β (i) R i s Ideal risk min(βi 2, σ 2 ) = min 0 m p mσ2 + β 2 (i) min m 0 m p σ2 + R 2 m 2s+1 i i>m Optimal m is the number parameters we need to estimate: m = {i : β i > σ} = Could perhaps mimic ideal risk if condition on X is valid with m Theorem S such that δ 2S + θ S,2S < 1. E ˆβ β 2 l 2 min 1 m S O(log p)(m σ2 + R 2 m 2s+1 ) Can recover the minimax rate of weak-l p balls
Connections with model selection Canonical selection procedure min y Xb 2 l 2 + Λ σ 2 {i : b i 0} Λ = 2: C p (Mallows 73), AIC (Akaike 74) Λ = log n: BIC (Schwarz 78) Λ = 2 log p: RIC (Foster and George 94)
Connections with model selection Canonical selection procedure min y Xb 2 l 2 + Λ σ 2 {i : b i 0} Λ = 2: C p (Mallows 73), AIC (Akaike 74) Λ = log n: BIC (Schwarz 78) Λ = 2 log p: RIC (Foster and George 94) NP-hard Practically impossible for p > 50, 60 Enormous and beautiful theoretical literature focussed on estimating Xβ: Barron, Cover; Foster, George; Donoho, Johnstone; Birgé, Massart; Barron, Birgé, Massart
In contrast... The Dantzig selector is practical (computationally feasible) The Dantzig selector does nearly as well if one had perfect information about the object of interest The setup is fairly general
Geometric intuition β X T X(β b) l 2λσ Constraint X (y X ˆβ) λ σ β is feasible ˆβ inside the diamond ˆβ l1 β l1 β is feasible ˆβ inside the slab {b : b l1 = β l1 } X X(β ˆβ) l 2λ σ True vector is feasible if (X j columns of X) z, X j λ σ X (Xβ X ˆβ) l X (Xβ y) l + X (y X ˆβ) l 2λ σ
Practical performance, I p = 256, n = 72. X is a Gaussian matrix, X ij i.i.d. N(0, 1/n) σ =.11 and λ = 3.5 so that the threshold is at δ = λ σ =.39. 3 3 2.5 true value estimate 2.5 true value estimate 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 0 50 100 150 200 250 300 Dantzig selector 2 0 50 100 150 200 250 300 Gauss Dantzig selector
Connections with soft-thresholding Suppose X is an orthogonal matrix Dantzig selection minimizes bi s.t. (X y) i b i < λ p σ Implications: DS = soft-thresholding of ỹ = X y at level λ p σ ỹ i λ p σ, ỹ i > λ p σ ˆβ i = 0, ỹ i λ p σ ỹ i + λ p σ, ỹ i < λ p σ Explains soft-thresholding-like behavior
The Gauss-Dantzig selector Two stage-procedure for bias-correction Stage 1 Estimate M = {i : β i 0} with ˆM = {i : ˆβ i > t σ}, t 0. Stage 2 Construct the Least Squares estimator ˆβ = min b V ˆM y Xb 2 l 2 Use the Dantzig selector to estimate the model M Construct a new estimator by regressing the data y onto the model ˆM.
Practical performance, II p = 5, 000, n = 1, 000 X is binary, X ij = ±1/ n w.p. 1/2 The nonzero entries of β are i.i.d. Cauchy σ =.5 and λ σ = 2.09 Ratio i ( ˆβ i β i ) 2 / min(β 2 i, σ2 ). S 5 10 20 50 100 150 200 Gauss-Dantzig selector 3.70 4.52 2.78 3.52 4.09 6.56 5.11 The constants are quite small!
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 4 3 3 2 2 1 1 0 0 1 1 2 3 2 4 All Coordinates 3 0 50 100 150 200 250 300 350 400 450 500 First 500 coordinates
Dantzig selector in image processing, I 50 100 150 200 250 50 100 150 200 250 3.5 1 3 0.8 2.5 0.6 2 0.4 1.5 1 0.2 0.5 0 0 0 20 40 60 80 100 120 140 0.2 0 50 100 150 200 250 300
50 50 100 100 150 150 200 200 250 250 300 300 350 350 400 400 450 450 500 1 50 100 150 200 250 300 350 400 450 500 500 1 50 100 150 200 250 300 350 400 450 500 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 Original Reconstruction 0 50 100 150 200 250 300 350 400 450 500 550 n =.043p 0 50 100 150 200 250 300 350 400 450 500 550
Significance for experimental design Want to sense a high-dimensional parameter vector Only a few coordinates are significant How many measurements do you need?
What should we measure? Want conditions for estimation to hold for large values of S X i,j N(0, 1): S n/ log(p/n) (Szarek) X i,j = ±1 wp 1/2: S n/ log(p/n) (Pajor et al., ) Randomized incomplete Fourier matrices Many others (tomorrow s talk) Random designs are good sensing matrices
Applications: signal acquisition Analog-to-digital converters, receivers,... Space: cameras, medical imaging devices,... Nyquist/Shannon foundation: must sample at twice the highest frequency Signals are wider and wider band Hardware brickwall: extremely fast/high-resolution samplers are decades away
Subsampling Signal model s(t) = 1 n p β k e i2πω kt k=1 Observe signal at n subsampled locations 1.0 0.5 0.0 0.5 1.0 0 10 20 30 40 50 y j = s(t j ) + σz j, 1 j n Model y = Xβ + σz, X j,k = 1 n e i2πω kt j X obeys conditions of theorem for large S: practically, S n/4 1.0 0.5 0.0 0.5 1.0 0 10 20 30 40 50
Sampling a spectrally sparse signal Length p = 1000 vector, with S = 20 active Fourier modes s(t) β(ω) 1 3.5 0.5 3 2.5 0 2 0.5 1.5 1 1 0.5 1.5 0 100 200 300 400 500 600 700 800 900 1000 0 0 100 200 300 400 500 600 700 800 900 1000 Measurements: sample at n (non-uniformly spaced) locations in time y j = s(t j ) + σz j, z j N(0, I n ) Recover using Dantzig-Gauss selector
Sampling a spectrally sparse signal Ideal recovery error E s ŝ 2 l 2 E β ideal β 2 l 2 S σ 2 For the DS, we observe (across a broad range of σ and n) E ˆβ DS β 2 l 2 1.15 S σ 2 Price of unknown support is 15% in additional error Example for n = 110: σ 2 Sσ 2 E β DS β 2 2 1.1 10 4 1.9 10 2 2.3 10 2 6.6 10 6 1.2 10 3 1.4 10 3 4.1 10 7 7.5 10 5 8.6 10 5 2.6 10 8 4.7 10 6 5.4 10 6 This has implications on the so-called Walden curve for Shannon based high-precision ADCs.
Testbed Detection of a Small Tone from L 1 Projection Small tone detection limited by clumping of error energy in sparse bins ) B 0 d ( e d - 2 0 u t i l p m - 4 0 A d e z i - 6 0 l a m r o N - 8 0!70 db 80.8 db Measured Recovered Signal (OCT9-Uncorrected) SNR = 67.63dB SFDR = 79.47dB Loading = PClip-4.74dB - 1 0 0-1 2 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 Frequency (MHz) ITAR/EAR-Controlled Information Do not export or release to the public without USG approval.
Summary Accurate estimation from few observations is possible Need to solve an LP Need incoherent design Opportunities Biomedical imagery New A/D devices Part of a larger body of work with many applications Sparse decompositions in overcomplete dictionaries Error-correction (decoding by LP) Information theory (universal encoders)
Is this magic? Download code at http://www.l1-magic.org
New paradigm for analog to digital Direct sampling reproduction of the light field; analog/digital photography, mid 19th century Indirect sampling data aquisition in a transformed domain, second half of 20th century; e.g. CT, MRI Compressive sampling data aquisition in an incoherent domain Take a comparably small number of general measurements rather than the usual pixels Impact: design incoherent analog sensors; e.g. optical Pay-off: far fewer sensors than what is usually considered necessary
Rice compressed sensing camera Richard Baraniuk, Kevin Kelly, Yehia Massoud, Don Johnson dsp.rice.edu/cs Other works: Coifman et al., Brady.
Dantzig selector in image processing, II N by N image: p = N 2 Data y = Xβ + z where (Xβ) k = t 1,t 2 β(t 1, t 2 ) cos(2π(k 1 t 1 + k 2 t 2 )/N), k = (k 1, k 2 ) Dantzig selection with residual vector r = y Xb. min b TV subject to (X r) i λ i σ
NIST: Imaging fuel cells Look inside fuel cells as they are operating via neutron imaging Accelerate process by limiting the number of projections Each projection = samples along radial lines in the Fourier domain Given measurements y, apply the Dantzig selector min b TV subject to X (y Xb) λσ where X = P Ω = partial pseudo-polar FFT (see Averbuch et. al (2001))
Imaging fuel cells Reconstruction from 20 projections: original backprojection Dantzig selector
Imaging fuel cells From actual measurements with constant SNR Backprojection from 720 slices every 0.25 degrees Dantzig from 60 slices every 3 Results are very preliminary; could improve with more sophisticated models