The Dantzig Selector

Similar documents
THE DANTZIG SELECTOR: STATISTICAL ESTIMATION WHEN p IS MUCH LARGER THAN n 1

Tractable Upper Bounds on the Restricted Isometry Constant

Near Optimal Signal Recovery from Random Projections

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Sparsity in Underdetermined Systems

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Lecture 24 May 30, 2018

Model Selection and Geometry

Reconstruction from Anisotropic Random Measurements

19.1 Problem setup: Sparse linear regression

ACCORDING to Shannon s sampling theorem, an analog

AN INTRODUCTION TO COMPRESSIVE SENSING

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Compressed Sensing and Related Learning Problems

Near-ideal model selection by l 1 minimization

Introduction to Compressed Sensing

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Toeplitz Compressed Sensing Matrices with. Applications to Sparse Channel Estimation

Stable Signal Recovery from Incomplete and Inaccurate Measurements

Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information

Mathematics Subject Classification (2000). Primary 00A69, 41-02, 68P30; Secondary 62C65.

An Overview of Sparsity with Applications to Compression, Restoration, and Inverse Problems

Primal Dual Pursuit A Homotopy based Algorithm for the Dantzig Selector

The Pros and Cons of Compressive Sensing

Lecture 22: More On Compressed Sensing

Error Correction via Linear Programming

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Compressed Sensing and Neural Networks

An Introduction to Sparse Approximation

Compressed Sensing and Sparse Recovery

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

A Survey of Compressive Sensing and Applications

Enhancing Sparsity by Reweighted l 1 Minimization

Greedy Signal Recovery and Uniform Uncertainty Principles

Optimization for Compressed Sensing

CoSaMP: Greedy Signal Recovery and Uniform Uncertainty Principles

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Model-Based Compressive Sensing for Signal Ensembles. Marco F. Duarte Volkan Cevher Richard G. Baraniuk

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Sparsity Regularization

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Compressive Sensing and Beyond

Lecture Notes 9: Constrained Optimization

The Analysis Cosparse Model for Signals and Images

Sparse Solutions of an Undetermined Linear System

Practical Signal Recovery from Random Projections

Randomness-in-Structured Ensembles for Compressed Sensing of Images

Least squares under convex constraint

Sensing systems limited by constraints: physical size, time, cost, energy

Improving Approximate Message Passing Recovery of Sparse Binary Vectors by Post Processing

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Structured matrix factorizations. Example: Eigenfaces

Towards a Mathematical Theory of Super-resolution

Sparse linear models

Shifting Inequality and Recovery of Sparse Signals

Signal Recovery from Permuted Observations

Inverse problems and sparse models (6/6) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France.

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Compressed Sensing: Extending CLEAN and NNLS

COMPRESSIVE SAMPLING USING EM ALGORITHM. Technical Report No: ASU/2014/4

Thresholds for the Recovery of Sparse Solutions via L1 Minimization

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Strengthened Sobolev inequalities for a random subspace of functions

Recovering overcomplete sparse representations from structured sensing

High-dimensional Statistics

Wavelet Footprints: Theory, Algorithms, and Applications

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

Compressed Sensing - Near Optimal Recovery of Signals from Highly Incomplete Measurements

ISyE 691 Data mining and analytics

Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit

On the Projection Matrices Influence in the Classification of Compressed Sensed ECG Signals

Bayesian Paradigm. Maximum A Posteriori Estimation

People Hearing Without Listening: An Introduction To Compressive Sampling

Computable Performance Analysis of Sparsity Recovery with Applications

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Introduction to compressive sampling

High-dimensional Statistical Models

CS 229r: Algorithms for Big Data Fall Lecture 19 Nov 5

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

GREEDY SIGNAL RECOVERY REVIEW

STAT 200C: High-dimensional Statistics

Stability of LS-CS-residual and modified-cs for sparse signal sequence reconstruction

Conditions for Robust Principal Component Analysis

Does Compressed Sensing have applications in Robust Statistics?

Analog-to-Information Conversion

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Optimisation Combinatoire et Convexe.

sparse and low-rank tensor recovery Cubic-Sketching

An Introduction to Sparse Representations and Compressive Sensing. Part I

Constrained optimization

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

High-dimensional covariance estimation based on Gaussian graphical models

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Signal Recovery From Incomplete and Inaccurate Measurements via Regularized Orthogonal Matching Pursuit

Compressed Sensing in Cancer Biology? (A Work in Progress)

Transcription:

The Dantzig Selector Emmanuel Candès California Institute of Technology Statistics Colloquium, Stanford University, November 2006 Collaborator: Terence Tao (UCLA)

Statistical linear model y = Xβ + z y R n : vector of observations (available to the statistician) X R n p : data matrix (known) β R p : object of interest (vector of parameters) z R n : stochastic error; here, z i i.i.d. N(0, σ 2 ) Goal estimate β from y (not Xβ)

Classical setup Number of parameters p fixed Number of observations n very large Asymptotics with n

Modern setup About as many parameters as observations: n p Many more parameters than observations: n < p or n p Very important subject in statistics today: high-dimensional data Conferences 100 s of articles

Example of interest: MRI angiography (sketch) β: N N image of interest = p = N 2 y: noisy Fourier coefficients of image 22 radial lines - 4% coverage for 512 by 512 image: n =.04p - 2% coverage for 1024 by 1024 image: n =.02p

Examples of modern setup Biomedical imaging fewer measurements than pixels Analog to digital conversion (ADCs) Inverse problems Genomics low number of observations (in the tens) total number of gene assayed (and considered as possible regressors) easily in the thousands Curve/surface estimation recovery of a continuous time curve from a finite number of noisy samples

Underdetermined systems Fundamental difficulty Fewer equations than unknowns Noiseless setup p > n y = X β Seems highly problematic

Sparsity as a way out? What if β is sparse or nearly sparse? Perhaps more observations than unknowns y = X * * Warning e.g. p = 2n = 20 * X = 1... 1 y 1 = β 1 y 2 = β 2 =.. y 10 = β 10 β Is it just smoke? What about β 11, β 12,..., β 20?

Encouraging example, I Reconstruct by solving min b b l1 := i b i s.t. Xb = y = Xβ β, 15 nonzeros n = 30 Fourier samples perfect recovery

Encouraging example, II Reconstruct β with min β TV := β(i 1, i 2 ) s.t. Xb = y = Xβ b R p p i 1,i 2 Original Phantom (Logan Shepp) Naive Reconstruction Reconstruction: min BV + nonnegativity constraint 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 original Filtered backprojection perfect recovery

A theory behind these results Consider the underdetermined system y = Xβ where β is S-sparse (only S of the coordinates of β are nonzero). Theorem ˆβ = argmin b R p b l1 subject to Xβ = y Perfect reconstruction ˆβ = β, provided X obeys a uniform uncertainty principle. (Tomorrow s talk)

Statistical estimation In real applications, data are corrupted Sensing model: y = Xβ + z Hopeless? (most of the singular values of X are zero)

The Dantzig selector Assume columns of X are unit-normed y = Xβ + z, z j iid N(0, σ 2 ) Dantzig selector (DS) ˆβ = argmin b l1 s.t. (X r) i λ p σ, for all i r is the vector of residuals: r = y Xb λ p = 2 log p (makes the true β feasible with large prob.)

Why l 1? Theory of noiseless case Precedents in Statistics

Why X r small (and not r)? X r l λ σ Invariance Imagine to rotate the data y = Xβ + z with U: Uy = UXβ + Uz ỹ = Xβ + z Estimator should not change The DS is invariant Correlations X r measures the correlation of the residuals with the predictors Don t leave jth predictor out when r, X j is large! Not the size of r that matters but that of X r; e.g. y = X j and σ > X j l

The Dantzig selector as a linear program min i Can be recast as a linear program (LP) b i s.t. X (y Xb) j λ p σ min i u i s.t. u b u λ p σ X (y Xb) λ p σ u, b R p (a b means a i b i for all i s)

Related approaches Lasso (Tibshirani, 96) min y Xb 2 l 2 s.t. b l1 t Basis pursuit denoising (Chen, Donoho, Saunders, 96) Dantzig selector is different min 1 2 y Xb 2 l 2 + Λ p σ b l1

The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability ˆβ β 2 l 2 O(2 log p) S σ 2

The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability ˆβ β 2 l 2 O(2 log p) S σ 2

The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability No blow up! ˆβ β 2 l 2 O(2 log p) S σ 2 σ Nicely degrades as noise level increases

The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability No blow up! ˆβ β 2 l 2 O(2 log p) S σ 2 σ Nicely degrades as noise level increases S MSE is proportional to the true (unknown) number of parameters (information content) = Adaptivity

The Dantzig selector works! Assume β is S-sparse and X satisfies some condition (δ 2S + θ S,2S < 1). Then with large probability No blow up! ˆβ β 2 l 2 O(2 log p) S σ 2 σ Nicely degrades as noise level increases S MSE is proportional to the true (unknown) number of parameters (information content) = Adaptivity 2 log p price to pay for not knowing where the significant coordinates are

Understanding the condition: restricted isometries X M R n M : columns of X with indices in M T Restricted isometry constant δ S For each S, δ s R is the smallest number s. t. (1 δ S ) Id XMX M (1 + δ S ) Id, M, M S. Sparse subsets of column vectors are approximately orthonormal.

Understanding the condition: the number θ M φ M θ = cosφ θ S,S is the smallest quantity for which Xb, X b θ S,S b l2 b l2 b, b supported on disjoint subsets M, M {1,..., p} with M S, M S

The condition is almost necessary Uniform Uncertainty Principle To estimate a S-sparse vector β we need X s.t. δ 2S + θ S,2S < 1 With δ 2S = 1, one can have h R 2p, 2S-sparse with Xh = 0 Decompose h = β β each S-sparse with X(β β ) = 0 Xβ = Xβ = Model is not identifiable!

Comparing with an oracle Oracle informs us of the support M of the unknown β. Ideal LS estimate x I LS = argmin y X M b 2 l 2, β I LS = (X MX M ) 1 X My Ideal because unachievable MSE E β I LS β 2 l 2 = Tr(X MX M ) 1 σ 2 S σ 2 With large probability, DS obeys ˆβ β 2 l 2 = O(log p) E β β I LS 2 l 2

Is this good enough? A more powerful oracle Oracle lets us observe all the coordinates of the unknown vector β directly y i = β i + z i, z i i.i.d. N(0, σ 2 ) Oracle tells us which coordinates are above the noise level Ideal shrinkage estimator (Donoho and Johnstone, 94) { ˆβ i I 0, β i < σ = y i, β i σ MSE of ideal estimator E ˆβ I β 2 l 2 = i E( ˆβ I i β i ) 2 = i min(β 2 i, σ 2 )

Main claim: oracle inequality Same assumptions as before. Then with large probability ( ˆβ β 2 l 2 O(2 log p) σ 2 + ) min(βi 2, σ 2 ). i DS does not have any oracle information Fewer observations than variables Direct observations of β are not available = Yet, nearly achieves the ideal MSE! Unexpected feat This is optimal. No estimator can essentially do better Extensions to compressible signals Other work: Haupt and Nowak (05)

An other oracle-style benchmark: Ideal model selection Subset M {1,..., p}. Span of M Least squares estimate V M := {b R p : b i = 0 i / M}. ˆβ M = argmin b VM y Xb 2 l 2 Oracle selects the ideal least-squares estimator β β = argmin M {1,...,p} E β ˆβ M 2 l 2 Ideal because unachievable The risk of the ideal estimator obeys E β β 2 l 2 1 min(βi 2, σ 2 ) 2 i

DS vs. ideal model selection With large probability, Dantzig selector obeys ˆβ β 2 l 2 = O(log p) E β β 2 l 2 DS does not have any oracle information and yet, nearly selects the best model nearly achieves the ideal MSE!

Extensions to compressible signals Popular model: power-law β (1) β (2)... β (p) β (i) R i s Ideal risk min(βi 2, σ 2 ) = min 0 m p mσ2 + β 2 (i) min m 0 m p σ2 + R 2 m 2s+1 i i>m Optimal m is the number parameters we need to estimate: m = {i : β i > σ} = Could perhaps mimic ideal risk if condition on X is valid with m

Extensions to compressible signals Popular model: power-law β (1) β (2)... β (p) β (i) R i s Ideal risk min(βi 2, σ 2 ) = min 0 m p mσ2 + β 2 (i) min m 0 m p σ2 + R 2 m 2s+1 i i>m Optimal m is the number parameters we need to estimate: m = {i : β i > σ} = Could perhaps mimic ideal risk if condition on X is valid with m Theorem S such that δ 2S + θ S,2S < 1. E ˆβ β 2 l 2 min 1 m S O(log p)(m σ2 + R 2 m 2s+1 ) Can recover the minimax rate of weak-l p balls

Connections with model selection Canonical selection procedure min y Xb 2 l 2 + Λ σ 2 {i : b i 0} Λ = 2: C p (Mallows 73), AIC (Akaike 74) Λ = log n: BIC (Schwarz 78) Λ = 2 log p: RIC (Foster and George 94)

Connections with model selection Canonical selection procedure min y Xb 2 l 2 + Λ σ 2 {i : b i 0} Λ = 2: C p (Mallows 73), AIC (Akaike 74) Λ = log n: BIC (Schwarz 78) Λ = 2 log p: RIC (Foster and George 94) NP-hard Practically impossible for p > 50, 60 Enormous and beautiful theoretical literature focussed on estimating Xβ: Barron, Cover; Foster, George; Donoho, Johnstone; Birgé, Massart; Barron, Birgé, Massart

In contrast... The Dantzig selector is practical (computationally feasible) The Dantzig selector does nearly as well if one had perfect information about the object of interest The setup is fairly general

Geometric intuition β X T X(β b) l 2λσ Constraint X (y X ˆβ) λ σ β is feasible ˆβ inside the diamond ˆβ l1 β l1 β is feasible ˆβ inside the slab {b : b l1 = β l1 } X X(β ˆβ) l 2λ σ True vector is feasible if (X j columns of X) z, X j λ σ X (Xβ X ˆβ) l X (Xβ y) l + X (y X ˆβ) l 2λ σ

Practical performance, I p = 256, n = 72. X is a Gaussian matrix, X ij i.i.d. N(0, 1/n) σ =.11 and λ = 3.5 so that the threshold is at δ = λ σ =.39. 3 3 2.5 true value estimate 2.5 true value estimate 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 0 50 100 150 200 250 300 Dantzig selector 2 0 50 100 150 200 250 300 Gauss Dantzig selector

Connections with soft-thresholding Suppose X is an orthogonal matrix Dantzig selection minimizes bi s.t. (X y) i b i < λ p σ Implications: DS = soft-thresholding of ỹ = X y at level λ p σ ỹ i λ p σ, ỹ i > λ p σ ˆβ i = 0, ỹ i λ p σ ỹ i + λ p σ, ỹ i < λ p σ Explains soft-thresholding-like behavior

The Gauss-Dantzig selector Two stage-procedure for bias-correction Stage 1 Estimate M = {i : β i 0} with ˆM = {i : ˆβ i > t σ}, t 0. Stage 2 Construct the Least Squares estimator ˆβ = min b V ˆM y Xb 2 l 2 Use the Dantzig selector to estimate the model M Construct a new estimator by regressing the data y onto the model ˆM.

Practical performance, II p = 5, 000, n = 1, 000 X is binary, X ij = ±1/ n w.p. 1/2 The nonzero entries of β are i.i.d. Cauchy σ =.5 and λ σ = 2.09 Ratio i ( ˆβ i β i ) 2 / min(β 2 i, σ2 ). S 5 10 20 50 100 150 200 Gauss-Dantzig selector 3.70 4.52 2.78 3.52 4.09 6.56 5.11 The constants are quite small!

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 4 3 3 2 2 1 1 0 0 1 1 2 3 2 4 All Coordinates 3 0 50 100 150 200 250 300 350 400 450 500 First 500 coordinates

Dantzig selector in image processing, I 50 100 150 200 250 50 100 150 200 250 3.5 1 3 0.8 2.5 0.6 2 0.4 1.5 1 0.2 0.5 0 0 0 20 40 60 80 100 120 140 0.2 0 50 100 150 200 250 300

50 50 100 100 150 150 200 200 250 250 300 300 350 350 400 400 450 450 500 1 50 100 150 200 250 300 350 400 450 500 500 1 50 100 150 200 250 300 350 400 450 500 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 Original Reconstruction 0 50 100 150 200 250 300 350 400 450 500 550 n =.043p 0 50 100 150 200 250 300 350 400 450 500 550

Significance for experimental design Want to sense a high-dimensional parameter vector Only a few coordinates are significant How many measurements do you need?

What should we measure? Want conditions for estimation to hold for large values of S X i,j N(0, 1): S n/ log(p/n) (Szarek) X i,j = ±1 wp 1/2: S n/ log(p/n) (Pajor et al., ) Randomized incomplete Fourier matrices Many others (tomorrow s talk) Random designs are good sensing matrices

Applications: signal acquisition Analog-to-digital converters, receivers,... Space: cameras, medical imaging devices,... Nyquist/Shannon foundation: must sample at twice the highest frequency Signals are wider and wider band Hardware brickwall: extremely fast/high-resolution samplers are decades away

Subsampling Signal model s(t) = 1 n p β k e i2πω kt k=1 Observe signal at n subsampled locations 1.0 0.5 0.0 0.5 1.0 0 10 20 30 40 50 y j = s(t j ) + σz j, 1 j n Model y = Xβ + σz, X j,k = 1 n e i2πω kt j X obeys conditions of theorem for large S: practically, S n/4 1.0 0.5 0.0 0.5 1.0 0 10 20 30 40 50

Sampling a spectrally sparse signal Length p = 1000 vector, with S = 20 active Fourier modes s(t) β(ω) 1 3.5 0.5 3 2.5 0 2 0.5 1.5 1 1 0.5 1.5 0 100 200 300 400 500 600 700 800 900 1000 0 0 100 200 300 400 500 600 700 800 900 1000 Measurements: sample at n (non-uniformly spaced) locations in time y j = s(t j ) + σz j, z j N(0, I n ) Recover using Dantzig-Gauss selector

Sampling a spectrally sparse signal Ideal recovery error E s ŝ 2 l 2 E β ideal β 2 l 2 S σ 2 For the DS, we observe (across a broad range of σ and n) E ˆβ DS β 2 l 2 1.15 S σ 2 Price of unknown support is 15% in additional error Example for n = 110: σ 2 Sσ 2 E β DS β 2 2 1.1 10 4 1.9 10 2 2.3 10 2 6.6 10 6 1.2 10 3 1.4 10 3 4.1 10 7 7.5 10 5 8.6 10 5 2.6 10 8 4.7 10 6 5.4 10 6 This has implications on the so-called Walden curve for Shannon based high-precision ADCs.

Testbed Detection of a Small Tone from L 1 Projection Small tone detection limited by clumping of error energy in sparse bins ) B 0 d ( e d - 2 0 u t i l p m - 4 0 A d e z i - 6 0 l a m r o N - 8 0!70 db 80.8 db Measured Recovered Signal (OCT9-Uncorrected) SNR = 67.63dB SFDR = 79.47dB Loading = PClip-4.74dB - 1 0 0-1 2 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 Frequency (MHz) ITAR/EAR-Controlled Information Do not export or release to the public without USG approval.

Summary Accurate estimation from few observations is possible Need to solve an LP Need incoherent design Opportunities Biomedical imagery New A/D devices Part of a larger body of work with many applications Sparse decompositions in overcomplete dictionaries Error-correction (decoding by LP) Information theory (universal encoders)

Is this magic? Download code at http://www.l1-magic.org

New paradigm for analog to digital Direct sampling reproduction of the light field; analog/digital photography, mid 19th century Indirect sampling data aquisition in a transformed domain, second half of 20th century; e.g. CT, MRI Compressive sampling data aquisition in an incoherent domain Take a comparably small number of general measurements rather than the usual pixels Impact: design incoherent analog sensors; e.g. optical Pay-off: far fewer sensors than what is usually considered necessary

Rice compressed sensing camera Richard Baraniuk, Kevin Kelly, Yehia Massoud, Don Johnson dsp.rice.edu/cs Other works: Coifman et al., Brady.

Dantzig selector in image processing, II N by N image: p = N 2 Data y = Xβ + z where (Xβ) k = t 1,t 2 β(t 1, t 2 ) cos(2π(k 1 t 1 + k 2 t 2 )/N), k = (k 1, k 2 ) Dantzig selection with residual vector r = y Xb. min b TV subject to (X r) i λ i σ

NIST: Imaging fuel cells Look inside fuel cells as they are operating via neutron imaging Accelerate process by limiting the number of projections Each projection = samples along radial lines in the Fourier domain Given measurements y, apply the Dantzig selector min b TV subject to X (y Xb) λσ where X = P Ω = partial pseudo-polar FFT (see Averbuch et. al (2001))

Imaging fuel cells Reconstruction from 20 projections: original backprojection Dantzig selector

Imaging fuel cells From actual measurements with constant SNR Backprojection from 720 slices every 0.25 degrees Dantzig from 60 slices every 3 Results are very preliminary; could improve with more sophisticated models