Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005

Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2 ) > Linear model, with β > Estimate with 1 ( X ' X) ' X y

Developing Trend > Classical model requires, but recent developments have pushed people beyond the classical model, to. p n p < n

New Data Types > MicroArray Data: is number of genes, is number of patients > Financial Data: is number of stocks, prices, etc, is number of time points n p p > Data Mining: automated data collection can imply large numbers of variables > Texture Classification in Images (eg. satellite): p is number of pixels, n is number of images n

Estimating the model > Can we find an estimate for when? β p n > George Box (1986) Effect-Sparsity: the vast majority of factors have zero effect, only a small fraction actually affect the reponse. y = Xβ + ε β > can still be modeled but now must be sparse, containing a few nonzero elements, the remaining elements zero.

Strategies for Sparse Modeling 1. All Subsets Regression Fit all possible linear models for all levels of sparsity. 2. Forward Stepwise Regression Greedy approach that chooses each variable in the model sequentially by significance level. 3. LASSO (Tibshirani 1994), LARS (Efron, Hastie, Johnstone, Tibshirani 2002) shrinks some coefficient estimates to zero. 4. Stochastic Search Variable Selection (SSVS) (George & McCulloch 1993) Set a prior that stipulates only a few terms in the model.

LASSO and LARS: a quick tour > LASSO solves: for a choice of. t min β y Xβ s.t. β t 2 2 1 > LARS: a stepwise approximation to LASSO Advantage: guaranteed to stop in n steps > Allowing variables to be removed from the current set of variables, gives LASSO solution.

A New Perspective > Up until now we ve described the statistical view of the problem when. > Now introduce ideas from Signal Processing and a new tool for understanding regression when, in the case of large. p p > Claim: This will allow us to see that, for certain problems, statistical solutions such as LASSO, LARS, are just as good as all subsets regression. n > n n

Background from Signal Processing > There exists a signal, and several ortho-bases (eg. sinusoids, wavelets, gabor). > Concatenation of several ortho-bases is a dictionary. > Postulate that the signal is sparsely representable, i.e. made up from few components of the dictionary. > Motivation: Image = Texture + Cartoon Signal = Sinusoids + Spikes Signal = CDMA + TDMA + FM + y

Overcomplete Dictionaries 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Canonical Basis φ( ω,0) φ ( ω1, ) Standard Fourier Basis n orthogonal columns where ωk = 2 π k, k = 0,, n/2 0,1 indicates cosine, sine orthogonal columns A= [ B B ] C F n 2n n is an overcomplete dictionary

Example: Image = Wavelets + Ridgelets (Stark 2000) Wavelet Component Ridgelet Component

Example: Image = Texture + Cartoon (Elad and Starck 2003) Cartoon (Curvelets) Texture (local sinusoids)

Example: Image = Texture + Cartoon (Elad and Starck 2003) Cartoon Original Image

Formal Signal Processing Problem Description Signal decomposition: y = Ax With a noise term: y = Ax+ z, z N(0, σ 2 ) Signal Matrix Coefficients Noise n p Regression y X β ε observations predictors Decomposition y A x z signal length p / n = #bases p > n If #bases > 1,.

Signal Processing Solutions 1..Matching Pursuit (Mallat, Zhang 1993) Forward Stepwise Regression 2..Basis Pursuit (Chen, Donoho 1994) Simple global optimization criteria: ( 1 ) min x x 1 s.t. P y = Ax 3. Maximally Sparse Solution: Intuitively most compelling but not feasible! ( 0 ) min x x 0 s.t. P y = Ax

l 0 Problem Impossible! > We can t hope to do an all subsets search, but we are lucky! ( P) 1 is a convex problem, and it can sometimes solve ( P ). 0

( l1, l0) Equivalence ( P ) > Signal processing results show 1 solves for certain problems. ( P0) > Donoho, Huo (IEEE IT, 2001) > Donoho, Elad (PNAS, 2003) > Tropp (IEEE IT, 2004) > Gribonval (IEEE IT, 2004) > Candès, Romberg, Tao (IEEE IT, to appear)

Phase Transition in Random Matrix Model A A, N(0,1) > n p, i j > y = Ax, where x has k random nonzeros, positions random. > Phase Plane ( δ, ρ) ρ = k/ n : degree of sparsity δ = n/ p : degree of underdetermination ρ w( δ ) Theorem (DLD 2005) There exists a critical such that, for every ρ < ρ w, for the overwhelming majority of ( ya, ) pairs, if ρ < ρ w, ( P1) solves ( P0).

( l l ) Phase Transition: 1, 0 equivalence ρ = k/ n Combinatorial Search! Combinatorial Search! P solves P0 1 δ = n/ p

Paradigm for study P > is a property of an algorithm, > ( ya, ) is a random ensemble, > Find the Phase Transitions for property. P > Approach pioneered by Donoho, Drori, and Tsaig P Many properties studied Many phase transitions

Phase Diagram Approach 1. Generate, where sparse. 2. Run full solution path, 3. Call the final path element, 4. Property P y Ax 0 : = 0 x x x 1 0 2 0 2 The following figures are due to Iddo Drori and Yaakov Tsaig. x ε x 1

Phase Diagram: Noiseless Model ρ = k/ n δ = n/ p LASSO δ = n/ p LARS p n = 200 varies 10 to 200

Phase Diagram: Noiseless Model ρ = k/ n δ = n/ p p n = 200 varies 10 to 200 Forward Stepwise

Phase Transition Surprises > Surprise: LASSO, in the noiseless case, final point same as best subset, for < ρ ρ LASSO > Hoped for: LARS, in the noiseless case, same as best subset, for < ρ ρ LARS > Surprise: Stepwise, in the noiseless case, only successful for ρ c ρ LASSO

This implies a statistics question! > Could results like these describe linear regression for noisy data? > For example, when are LASSO, LARS, Forward Stepwise just as good as all subsets regression? > Reformulate problems with Noise: ( 0, λ) min P β y Xβ + λ β ( P1, λ) min β y Xβ + λ β 2 2 0 2 2 1

Experiment Setup X > n p, with random entries generated from N(0,1), and normalized columns. > β is a p-vector with the first k entries drawn from U (0,100) remaining entries 0. > ε is a random N(0,16) n-vector. >Create y = Xβ + ε ˆβ > We find the solution using and an algorithm (LASSO, LARS, Forward Stepwise) with y and X as inputs and the appropriate stopping criterion.

Implementation of LASSO > Engineering approach: stopping criteria: y X ˆ β z 2 2 2 2

Questions > Will there be any phase transition? > Can we learn something about the properties of these algorithms from the Phase Diagram?

Property of Interest: Risk Ratio > Ideal Risk: > Actual Risk: R ˆ tr X X R ideal t 1 2 ( β ) = {( I I) } σ 2 ( ˆ β ) = E ˆ β β 2 > Compare empirical risk to notion of oracle inequality (Donoho and Johnstone 1994) 2 ideal R c0+ c1σ R c = σ ; c = 2log( p) 2 0 1 R( ˆ β ) > Can plot in a Phase Diagram. ideal R ( ˆ β )

Phase Diagram: LASSO for Noisy Model p n = 200 Risk Ratio varies 10 to 200

Phase Diagram: LARS for Noisy Model p n = 200 Risk Ratio varies 10 to 200

Phase Diagram: Stepwise for Noisy Model p n = 200 Risk Ratio varies 10 to 200

Experiences with Noisy Case > Phase Diagrams revealing, stimulating. > Stepwise Regression falls apart at a critical sparsity level (why?) > LARS in same cases works very well! > Suggests other interesting properties to study. > Other algorithms: Forward Stagewise, Backward Elimination, SVSS.

Introducing SparseLab! http://www-stat.stanford.edu/~sparselab > Make software solutions for sparse systems available. > Growing research on sparsity, variable selection issues could advance the research community if they have standard tools. > SparseLab is a system to do this.

SparseLab in Depth > Currently included solvers: IHT Iterative Hard Thresholding IRWLS Iteratively Reweighted Least Squares IST Iterative Soft Thresholding LARS Least Angle Regression MP Matching Pursuit OMP Orthogonal Matching Pursuit PDCO Primal-Dual Barrier Method for Convex Objectives Simplex Forward Stepwise

SparseLab in Depth > Current examples: Compressed Sensing Deconvolving mrna Error Correcting Codes NMR Spectroscopy NonNegative Fourier Time Frequency

SparseLab in Depth > Reproducible Research: SparseLab makes available the code to reproduce figures in published papers. > Papers currently included: Extensions of Compressed Sensing (Tsaig, Donoho 2005) 1 Breakdown of Equivalence between the Minimal l -norm Solution and the Sparsest Solution (Tsaig, Donoho 2004) High-Dimensional Centrally-Symmetric Polytopes With Neighborliness Proportional to Dimension (Donoho 2005) > All open source!

Conclusions: Ambitions for Thesis > SparseLab publicly available Containing 5-10 papers Intuitive interfaces for both Statistics and Signal Processing > Use of Phase Diagram to investigate behavior of: LARS LASSO Forward Stepwise Forward Stagewise SVSS Log Penalty in the noisy case

Acknowledgments Dave Donoho Iddo Drori Joshua Sweetkind-Singer Yaakov Tsaig