Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
Want to estimate a parameter β R p Example: How is a response y R related to the Parkinson s disease affected by a set of genes among the Chinese population? Construct a linear model: y = β T x + ɛ, where E (y x) = β T x Parameter: Non-zero entries in β (sparsity of β) identify a subset of genes and indicate how much they influence y Take a random sample of (X, Y ), and use the sample to estimate β; that is, we have Y = Xβ + ɛ
Model selection and parameter estimation When can we approximately recover β from n noisy observations Y? Questions: How many measurements n do we need in order to recover the non-zero positions in β? How does n scale with p or s, where s is the number of non-zero entries of β? What assumptions about the data matrix X are reasonable?
Sparse recovery When β is known to be s-sparse for some 1 s n, which means that at most s of the coefficients of β can be non-zero: Assume every s columns of X are linearly independent: Identifiability condition (reasonable once n s) Λ min (s) = min υ 0,s-sparse Xυ n υ > 0 Proposition: (Candès-Tao 05) Suppose that any s columns of the n p matrix X are linearly independent Then, any s-sparse signal β R p can be reconstructed uniquely from Xβ
l 0 -minimization How to reconstruct an s-sparse signal β R p from the measurements Y = Xβ given Λ min (s) > 0? Let β be the unique sparsest solution to Xβ = Y : β = arg min β:xβ=y β 0 where β 0 := #{1 i p : β i 0} is the sparsity of β Unfortunately, l 0 -minimization is computationally intractable; (in fact, it is an NP-complete problem)
Basis pursuit Consider the following convex optimization problem β := arg min β:xβ=y β 1 Basis pursuit works whenever the n p measurement matrix X is sufficiently incoherent: RIP (Candès-Tao 05) requires that for all T {1,, p} with T s and for all coefficients sequences (c j ) j T, (1 δ s ) c X T c/n (1 + δ s ) c holds for some 0 < δ s < 1 (s-restricted isometry constant) The good matrices for compressed sensing should satisfy the inequalities for the largest possible s
Restricted Isometry Property (RIP): examples For Gaussian random matrix, or any sub-gaussian ensemble, RIP holds with s n/ log(p/n) For random Fourier ensemble, or randomly sampled rows of orthonormal matrices, RIP holds for s = O(n/ log 4 p) For a random matrix composed of columns that are independent isotropic vectors with log-concave densities, RIP holds for s = O(n/log (p/n)) References: Candès-Tao 05, 06, Rudelson-Vershynin 05, Donoho 06, Baraniuk et al 08, Mendelson et al 08, Adamczak et al 09
Basis pursuit for high dimensional data These algorithms are also robust with regards to noise, and RIP will be replaced by more relaxed conditions In particular, the isotropicity condition which has been assumed in all literature cited above needs to be dropped Let X i R p, i = 1,, n be iid random row vectors of the design matrix X Covariance matrix: Σ(X i ) = EX i X i = EX i Xi T Σ n = 1 n X i X i = 1 n X i Xi T n n i=1 X i is isotropic if Σ(X i ) = I and E X i = n i=1
Sparse recovery for Y = Xβ + ɛ Lasso (Tibshirani 96), aka Basis Pursuit (Chen, Donoho and Saunders 98, and others): β = arg min β Y Xβ /n + λ n β 1, where the scaling factor 1/(n) is chosen by convenience Dantzig selector (Candès-Tao 07): (DS) arg min β R p β 1 subject to X T (Y X β)/n λ n References: Greenshtein-Ritov 04, Meinshausen-Bühlmann 06, Zhao-Yu 06, Bunea et al 07, Candès-Tao 07, van de Geer 08, Zhang-Huang 08, Wainwright 09, Koltchinskii 09, Meinshausen-Yu 09, Bickel et al 09, and others
The Cone Constraint For an appropriately chosen λ n, the solution of the Lasso or the Dantzig selector satisfies (under iid Gaussian noise), with high probability, υ := β β C(s, k 0 ) k 0 = 1 for the Dantzig selector, and k 0 = 3 for the Lasso Object of interest: for 1 s 0 p, and a positive number k 0, C(s 0, k 0 ) = { x R p J {1,, p}, J = s 0 st x J c 1 k 0 x J 1 } This object has appeared in earlier work in the noiseless setting References: Donoho-Huo 01, Elad-Bruckstein 0, Feuer-Nemirovski 03, Candès-Tao 07, Bickel-Ritov-Tsybakov 09, Cohen-Dahmen-DeVore 09
The Lasso solution
Restricted Eigenvalue (RE) condition Object of interest: C(s 0, k 0 ) = { x R p J {1,, p}, J = s 0 st x J c 1 k 0 x J 1 } Definition Matrix A q p satisfies RE(s 0, k 0, A) condition with parameter K (s 0, k 0, A) if for any υ 0, 1 K (s 0, k 0, A) := min J {1,,p}, J s 0 Aυ min > 0 υ J c 1 k 0 υ J 1 υ J References: van de Geer 07, Bickel-Ritov-Tsybakov 09, van de Geer-Bühlmann 09
An elementary estimate Lemma For each vector υ C(s 0, k 0 ), let T 0 denote the locations of the s 0 1 largest coefficients of υ in absolute values Then υ T c 0 υ 1 T0, and υ T0 υ 1+k0 Implication: Let A be a q p matrix such that RE(s 0, 3k 0, A) condition holds for 0 < K (s 0, 3k 0, A) < Then υ C(s 0, k 0 ) S p 1 Aυ υ T0 K (s 0, k 0, A) 1 K (s 0, k 0, A) 1 + k 0 > 0
Sparse eigenvalues Definition For m p, we define the largest and smallest m-sparse eigenvalue of a q p matrix A to be ρ max (m, A) := max t R p,t 0;m sparse At / t, ρ min (m, A) := min t R p,t 0;m sparse At / t If RE(s 0, k 0, A) is satisfied with k 0 1, then the square submatrices of size s 0 of A T A are necessarily positive definite, that is, ρ min (s 0, A) > 0
Examples: of A which satisfies the Restricted Eigenvalue condition, but not RIP (Ruskutti, Wainwright, and Yu 10) Spiked Identity matrix: for a [0, 1), Σ p p = (1 a)i p p + a 1 1 T where 1 R p is the vector of all ones ρ min (Σ) > 0 Then for all s 0 s 0 submatrix Σ SS, we have ρ max (Σ SS ) ρ min (Σ SS ) = 1 + a(s 0 1) 1 a Largest sparse eigenvalue as s 0, but Σ 1/ e j = 1 is bounded
Motivation: to construct classes of design matrices such that the Restricted Eigenvalue condition will be satisfied Design matrix X has just independent rows, rather than independent entries: eg, consider for some matrix A q p X = ΨA, where rows of the matrix Ψ n q are independent isotropic vectors with subgaussian marginals, and RE(s 0, (1 + ε)k 0, A) holds for some ε > 0, p > s 0 0, and k 0 > 0 Design matrix X consists of independent identically distributed rows with bounded entries, whose covariance matrix Σ(X i ) = EX i X T i satisfies RE(s 0, (1 + ε)k 0, Σ 1/ ) The rows of X will be sampled from some distributions in R p ; The distribution may be highly non-gaussian and perhaps discrete
Outline Introduction The main results The reduction principle Applications of the reduction principle Ingredients of the proof Conclusion
Notation Let e 1,, e p be the canonical basis of R p For a set J {1,, p}, denote E J = span{e j : j J} For a matrix A, we use A to denote its operator norm For a set V R p, we let conv V denote the convex hull of V For a finite set Y, the cardinality is denoted by Y Let B p and Sp 1 be the unit Euclidean ball and the unit sphere respectively
The reduction principle: Theorem Let E = J =d E J for d(3k 0 ) < p, where d(3k 0 ) = s 0 + s 0 max Ae j 16K (s 0, 3k 0, A)(3k 0 ) (3k 0 + 1) j δ and E denotes R p otherwise Let Ψ be a matrix such that x AE (1 δ) x Ψx (1 + δ) x Then RE(s 0, k 0, ΨA) holds with 0 < K (s 0, k 0, ΨA) K (s 0,k 0,A) 1 5δ If the matrix Ψ acts as almost isometry on the images of the d-sparse vectors under A, then the product ΨA satisfies the RE condition with a smaller parameter k 0
Reformulation of the reduction principle: Theorem Restrictive Isometry Let Ψ be a matrix such that x AE (1 δ) x Ψx (1 + δ) x ( ) Then for any x A C(s 0, k 0 ) S q 1, (1 5δ) Ψx (1 + 3δ) If the matrix Ψ acts as almost isometry on the images of the d-sparse vectors under A, then it acts the same way on the images of C(s 0, k 0 ) It is reduced to checking that the almost isometry property holds for all vectors from some low-dimensional subspaces which is easier than checking the RE property directly
Definition: subgaussian random vectors Let Y be a random vector in R p 1 Y is called isotropic if for every y R p, E ( Y, y ) = y Y is ψ with a constant α if for every y R p, ( ) Y, y ψ := inf{t : E exp( Y, y /t ) } α y The ψ condition on a scalar random variable V is equivalent to the subgaussian tail decay of V, which means for some constant c, P ( V > t) exp( t /c ), for all t > 0 A random vector Y in R p is subgaussian if the one-dimensional marginals Y, y are sub-gaussian random variables for all y R p
The first application of the reduction principle Let A be a q p matrix satisfying RE(s 0, 3k 0, A) condition Let m = min(d, p) where d = s 0 + s 0 max Ae j 16K (s 0, 3k 0, A)(3k 0 ) (3k 0 + 1) j δ Theorem Let Ψ be an n q matrix whose rows are independent isotropic ψ random vectors in R q with constant α Suppose n 000mα4 δ log ( ) 60ep mδ Then with probability at least 1 exp( δ n/000α 4 ), RE(s 0, k 0, (1/ n)ψa) condition holds with 0 < K (s 0, k 0, (1/ n)ψa) K (s 0, k 0, A) 1 δ
Examples of subgaussian vectors The random vector Y with iid N(0, 1) random coordinates Discrete Gaussian vector, which is a random vector taking values on the integer lattice X p with distribution P(X = m) = C exp( m /) for m X p A vector with independent centered bounded random coordinates In particular, vectors with random symmetric Bernoulli coordinates, in other words, random vertices of the discrete cube
Previous results on (sub)gaussian random vectors Raskutti, Wainwright, and Yu 10: RE(s 0, k 0, X) holds for random Gaussian measurements / design matrix X which consists of n = O(s 0 log p) independent copies of a Gaussian random vector Y N p (0, Σ), assuming that the RE condition holds for Σ 1/ Their proof relies on a deep result from the theory of Gaussian random processes Gordon s Minimax Lemma To establish the RE condition for more general classes of random matrices we had to introduce a new approach based on geometric functional analysis, namely, the reduction principle The bound n = O(s 0 log p) can be improved to the optimal one n = O(s 0 log(p/s 0 )) when RE(s 0, k 0, Σ 1/ ) is replaced with RE(s 0, (1 + ε)k 0, Σ 1/ ) for any ε > 0
In Zhou 09, subgaussian random matrices of the form X = ΨΣ 1/ was considered, where Σ is a p p positive semidefinite matrix: X satisfies RE(s 0, k 0 ) condition with overwhelming probability if for K := K (s 0, k 0, Σ 1/ ), n > 9c α 4 { δ ( + k 0 ) K 4ρ max (s 0, Σ 1/ )s 0 log 5ep } s 0 log p s 0 Analysis there used a result in Mendelson et al 07, 08 The current result does not involve ρ max (s 0, A), nor the global parameters of the matrices A and Ψ, such as the norm or the smallest singular value Recall the Spiked Identity matrix: for a [0, 1), Σ p p = (1 a)i p p + a 1 1 T which satisfies the RE condition, such that ρ max (s 0, A) grows linearly with s 0 while the maximum of Σ 1/ e j = 1
Design matrices with uniformly bounded entries Let Y R p be a random vector such that Y M as and denote Σ = EYY T Let X be an n p matrix, whose rows X 1,, X n Y Set d = s 0 + s 0 max j ( Σ 1/ e j ) 16K (s 0, 3k 0, Σ 1/ )(3k 0 ) (3k 0 + 1) δ Theorem Assume that d p and ρ = ρ min (d, Σ 1/ ) > 0 Let Σ satisfy the RE(s 0, 3k 0, Σ 1/ ) condition Suppose n CM d log p ρδ log 3 ( CM d log p Then with probability at least 1 exp ( δρn/(6m d) ), RE(s 0, k 0, X) holds for matrix X/ n with 0 < K (s 0, k 0, X/ n)) K (s 0,k 0,Σ 1/ ) 1 δ ρδ )
Remarks on applying the reduction principle To analyze different classes of random design matrices: Unlike the case of a random matrix with subgaussian marginals, the estimate of the second example contains the minimal sparse singular value ρ min (d, Σ 1/ ) The reconstruction of sparse signals by subgaussian design matrices or by random Fourier ensemble was analyzed in the literature before, however only under the RIP assumptions The reduction principle can be applied to other types of random variables: eg, random vectors with heavy-tailed marginals, or random vectors with log-concave densities References: Rudelson-Vershynin 05, Baraniuk et al 08, Mendelson et al 08, Vershynin 11a, b, Adamczak et al 09
Maurey s empirical approximation argument (Pisier 81) Let u 1,, u M R q Let y conv(u 1,, u M ): y = j {1,,M} α j u j where α j 0, and j There exists a set L {1,,, M} such that L m = 4 max j {1,,M} u j ε and a vector y conv(u j, j L) such that y y ε α j = 1 proof An application of the probabilistic methods: if we only want to approximate y rather than exactly represent it as a convex combination of u 1,, u M, this is possible with much fewer points, namely, u 1,, u L
Let y = j {1,,M} α j u j where α j 0, and j Goal: to find a vector y conv(u j, j L) such that y y ε α j = 1
Let Y be a random vector in R q such that P (Y = u l ) = α l, l {1,, M} Then E (Y ) = α l u l = y l {1,,M}
Let Y 1,, Y m be independent copies of Y and let ε 1,, ε m be ±1 iid mean zero Bernoulli random variables, chosen independently of Y 1,, Y m By the standard symmetrization argument we have E y 1 m where m j=1 Y j E 4E 1 m m ε j Y j j=1 4 max l {1,,M} u l m ( ) Y j sup Y j max u l l {1,,M} and the last inequality in (1) follows from the definition of m = 4 m ( ) m E Y j j=1 ε (1)
Fix a realization Y j = u kj, j = 1,, m for which u km-1 y 1 m m j=1 u k m Y j ε u k3 u k1 The vector 1 m m j=1 Y j belongs to the convex hull of {u l : l L}, where L is the set of different elements from the sequence k 1,, k m Obviously L m and the lemma is proved QED u k
The Inclusion Lemma To( prove the) restricted isometry of Ψ over the set of vectors in A C(s 0, k 0 ) S q 1, we show that this set is contained in the convex hull of the images of the sparse vectors with norms not exceeding (1 δ) 1 Lemma Let 1 > δ > 0 Suppose RE(s 0, k 0, A) condition holds for matrix A q p For a set J {1,, p}, E J = span{e j : j J} Set Then d = d(k 0, A) = s 0 + s 0 max Ae j j ( ) A C(s 0, k 0 ) S q 1 (1 δ) 1 conv where for d p, E J is understood to be R p ( 16K (s 0, k 0, A)k0 (k ) 0 + 1) J d δ AE J S q 1
Conclusion We prove a general reduction principle showing that if the matrix Ψ acts as almost isometry on the images of the sparse vectors under A, then the product ΨA satisfies the RE condition with a smaller parameter k 0 We apply the reduction principle to analyze different classes of random design matrices This analysis is reduced to checking that the almost isometry property holds for all vectors from some low-dimensional subspaces, which is easier than checking the RE property directly
Thank you!