arxiv: v1 [stat.ml] 3 Nov 2010

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 3 Nov 2010"

Clarence Beasley
5 years ago
Views:

1 Preprint The Lasso under Heteroscedasticity Jinzhu Jia, Karl Rohe and Bin Yu, arxiv:0.06v stat.ml 3 Nov 00 Department of Statistics and Department of EECS University of California, Berkeley Abstract: The performance of the Lasso is well understood under the assumptions of the standard linear model with homoscedastic noise. However, in several applications, the standard model does not describe the important features of the data. This paper examines how the Lasso performs on a non-standard model that is motivated by medical imaging applications. In these applications, the variance of the noise scales linearly with the expectation of the observation. Like all heteroscedastic models, the noise terms in this Poisson-like model are not independent of the design matrix. More specifically, this paper studies the sign consistency of the Lasso under a sparse Poisson-like model. In addition to studying sufficient conditions for the sign consistency of the Lasso estimate, this paper also gives necessary conditions for sign consistency. Both sets of conditions are comparable to results for the homoscedastic model, showing that when a measure of the signal to noise ratio is large, the Lasso performs well on both Poisson-like data and homoscedastic data. Simulations reveal that the Lasso performs equally well in terms of model selection performance on both Poisson-like data and homoscedastic data (with properly scaled noise variance), across a range of parameterizations. Taken as a whole, these results suggest that the Lasso is robust to the Poisson-like heteroscedastic noise. Key words and phrases: Lasso, Poisson-like Model, Sign Consistency, Heteroscedasticity Introduction The Lasso (Tibshirani, 996) is widely used in high dimensional regression for variable selection. Its model selection performance has been well studied under a standard sparse and homoskedastic regression model. Several researchers have shown that under sparsity and regularity conditions, the Lasso can select the true model asymptotically even when p n (Donoho et al., 006; Meinshausen

2 JINZHU JIA, KARL ROHE AND BIN YU and Buhlmann, 006; Tropp, 006; Wainwright, 009; Zhao and Yu, 006). To define the Lasso estimate, suppose the observed data are independent pairs {(x i, Y i )} R p R for i =,,..., n following the linear regression model Y i = x T i β + ɛ i, () where x T i is a row vector representing the predictors for the ith observation, Y i is the corresponding ith response variable, ɛ i s are independent and mean zero noise terms, and β R p. Use X R n p to denote the n p design matrix with x T k = (X k,..., X kp ) as its kth row and with X j = (X j,..., X jn ) T as its jth column, then x T x X = T. x T n = (X, X,..., X p ). Let Y = (Y,..., Y n ) T and ɛ = (ɛ, ɛ,..., ɛ n ) T R n. The Lasso estimate (Tibshirani, 996) is then defined as the solution to a penalized least squares problem (with regularization parameter λ): ˆβ(λ) = arg min β n Y Xβ + λ β, () where for some vector x R k, x r = ( k i= x i r ) /r. In previous research on the Lasso, the above model has been assumed where the noise terms are i.i.d. and independent of the predictors (hence homoskedastic). We call this the standard model. Candes and Tao (007) suggested that compressed sensing, a sparse method similar to the Lasso could reduce the number of measurements needed by medical imaging technology like Magnetic Resonance Imaging (MRI). This methodology was later applied to MRI by Lustig et al. (008). The standard model was useful to their analyses. However, the standard model is not appropriate for other imaging methods such as PET and SPECT (Fessler, 000). PET provides an indirect measure for the metabolic activity of a specific tissue. To take an image, a biochemical metabolite must be identified that is attractive to the tissue under investigation. This biochemical metabolite is labeled with a positron emitting radioactive material and it is then injected into

3 THE LASSO UNDER HETEROSCEDASTICITY 3 the subject. This substance circulates through the subject, emitting positrons. When the tissue gathers the metabolite, the radioactive material concentrates around the tissue. The positron emissions can be modeled by a Poisson point process in three dimensions with an intensity rate proportional to the varying concentrations of the biochemical metabolite. Therefore, an estimate of the intensity rate is an estimate of the level of biochemcial metabolite. However, the positron emissions are not directly observed. After each positron is emitted it very quickly annihilates a nearby electron, sending two X-ray photons in nearly opposite directions (at the speed of light) Vardi et al. (985). These X-rays are observed by several sensors in a ring surrounding the subject. A physical model of this system informs the estimation of the intensity level of the Poisson process from the observed data. It can be expressed as a Poisson model where the sample size n represents the number of sensors; Y is a vector of observed values; βj represents the Poisson intensity rate for a small cubic volume (a voxel) inside the subject; the design matrix X specifies the physics of the tomography and emissions process; and p is the number of voxels wanted, the more voxels, the finer the resolution of the final image. Because the positron emissions are modeled by a Poisson point process, the variance of each observed value Y i is equal to the expected value E(Y i ). Motivated by the Poissonian model, this paper studies the Lasso under the following sparse Poisson-like model: Y = Xβ + ɛ, E(ɛ X) = 0, Cov(ɛ X) = σ diag( Xβ ), ɛ X(S c ) X(S), (3) where σ > 0 and the sparsity index set is defined as S = { j p : β j 0}, with the cardinality q = #S such that 0 < q < p. In the definition of the Poisson-like model, ɛ conditioned on X consists of independent Gaussian variables; Cov(ɛ X), the variance-covariance matrix of ɛ conditioned on X, is σ diag( Xβ ), an n n diagonal matrix with the vector

4 4 JINZHU JIA, KARL ROHE AND BIN YU σ Xβ down the diagonal; and X(S) and X(S c ) denote two matrices consisting of the relevant column vectors ( with nonzero coefficients) and irrelevant column vectors ( with zero coefficients) respectively. This is a heteroscedastic model. Since the Lasso provides a computationally feasible way to select a model (Efron et al., 004; Osborne et al., 000; Rosset, 004; Zhao and Yu, 007), it can be applied in the non-standard settings to give sparse solutions. In this paper we show that the Lasso is robust to the heteroscedastic noise of the sparse Poisson-like model. Under the Poisson-like model, for general scalings of p, q, n, and β, this paper investigates when the Lasso is sign consistent and when it is not with theoretical and simulation studies. Our results are comparable to the results for the standard model; when a measure of the signal to noise ratio is large, the Lasso is sign consistent. This is the first study of the sign consistency of the standard Lasso in a heteroscedastic setting.. Overview of Previous Work The Lasso (Tibshirani, 996) has been a popular technique to simultaneously select a model and provide regularized estimated coefficients. There is a substantial literature on the use of the Lasso for sparsity recovery and subset selection under the standard homoscedastic linear model. This subsection gives only a very brief overview. In noiseless setting (when ɛ = 0), with contributions from a broad range of researchers (Chen et al., 998; Donoho and Huo., 00; E. Candes and Tao., 004; Elad and Bruckstein., 00, 003; Tropp., 004), there is now much understanding of sufficient conditions on deterministic predictors {X i, i =,..., n} and sparsity index S = {j : βj 0} for which the true β can be recovered exactly. Results by Donoho (004), as well as Candes and Tao (005) provide high probability results for random ensembles of X. There is also a substantial body of work focusing on the noisy setting (where ɛ is random noise). Knight and Fu (000) analyze the asymptotic behavior of the optimal solution for fixed dimension (p); not only for L regularization, but for L r regularization with r (0,. Both Tropp (006) and Donoho et al. (006) provide sufficient conditions for the support of the optimal solution to the Lasso

5 THE LASSO UNDER HETEROSCEDASTICITY 5 problem () to be contained within the support of β. Recent work on the use of the Lasso for model selection by Meinshausen and Buhlmann (006), focuses on Gaussian graphical models. Zhao and Yu (006) considers linear regression and more general noise distributions. For the case of Gaussian noise and Gaussian predictors, both papers established that under particular mutual incoherence conditions and the appropriate choice of the regularization parameter λ, the Lasso can recover the sparsity pattern with probability converging to one for particular regimes of n, p and q. Zhao and Yu (006) used a particular mutual incoherence condition, the Irrepresentable Condition, which they show is almost necessary when p is fixed. The Irrepresentable Condition was found in Fuchs (005) and Zou (006) as well. For i.i.d. Gaussian or sub-gaussian noise, Wainwright (009) established a sharp relation between the problem dimension p, the number q of nonzero elements in β, and the number of observations n that are required for sign consistency.. The Contributions in this Paper Before giving the contributions of this paper, some definitions are needed. Define if x > 0 sign(x) = 0 if x = 0 if x < 0. Define = s such that ˆβ(λ) = s β if and only if sign( ˆβ(λ)) = sign(β ) elementwise. Definition. The Lasso is sign consistent if there exists a sequence λ n such that, ( P ˆβ(λn ) = s β ), as n. This paper studies the sign consistency of the Lasso applied to data from the sparse Poisson-like model, giving non-asymptotic results for both the deterministic design and the Gaussian random design. The non-asymptotic results give the probability that ˆβ(λ) = s β, for any λ, p, q, and n. The sign consistency results follow from the non-asymptotic results. This paper also gives necessary conditions for the Lasso to be sign consistent under the sparse Poisson-like model. It is shown that the Irrepresentable Condition is necessary for the Lasso s sign consistency under this model. This condition is also necessary under the standard

6 6 JINZHU JIA, KARL ROHE AND BIN YU model (Wainwright, 009; Zhao and Yu, 006; Zou, 006). The sufficient conditions for both the deterministic design and random Gaussian design require that the variance of the noise is not too large and that the smallest nonzero element of β is not too small. Define the smallest nonzero element of β as M(β ) = min j S β j. For deterministic design, assume that ( ) Λ min n X(S)T X(S) C min > 0, where Λ min ( ) denotes the minimal eigenvalue of a matrix and C min is some positive constant; for random Gaussian design, assume that Λ min (Σ ) C min > 0 and Λ max (Σ) C max <, where Σ R q q is the variance-covariance matrix of the true predictors, Σ R p p is the variance-covariance matrix of all predictors, Λ max ( ) denotes the maximal eigenvalue of a matrix, and C min and C max are some positive constants. An essential quantity for determining the probability of sign recovery is the signal to noise ratio SNR = nm(β ) σ β. (4) The numerator corresponds to the signal strength for sign recovery. The most difficult sign to estimate in β is the element that corresponds to M(β ). When the smallest element is larger, estimating the signs is easier, and the signal is more powerful. The noise term in the denominator contains β because in the heteroscedastic model considered in this paper, β is fundamental in the scaling of the noise. Because this paper is addressing the sign consistency of the Lasso, the definition of the signal in SNR does not correspond to the typical definition. When SN R is large, the Lasso is sign consistent. Specifically, the sufficient condition for deterministic design requires that ( ) SNR = Ω q log(p + ) max x i (S), i

7 THE LASSO UNDER HETEROSCEDASTICITY 7 where a n = Ω(b n ) means that a n grows faster than b n, that is a n /b n. The sufficient conditions for random Gaussian design requires that SNR 8 log n max(4q, log n)/ C min, and SNR = Ω (q log(p q + )). The previous three inequalities all require that the signal to noise ratio is large. This essential result for the Poisson-like model is identical to the corresponding result for the standard model both require that the variance of the noise is small compared to the size of the signal. The simulations in Section 4 support this essential result. Several of the mathematical techniques in this paper come from Wainwright (009). However, our proofs are more involved because the errors in the Poissonlike model are heteroscedastic and dependent on the design matrix. The results in this paper require bounding ɛ T ɛ n and X(S) T ɛ. When the errors are Gaussian and homoscedastic, the first quantity is distributed as χ. When the errors are heteroscedastic, the distribution becomes more complicated. This paper calculates the second moment of ɛ i and bounds ɛt ɛ/n via the Chebyshev inequality. Similarly for the second quantity, when the errors are dependent of the design matrix, the variance of X(S) T ɛ is more difficult to bound. In this paper, the variance of X(S) T ɛ is bounded using P max x i(s) 8 C max max (4q, log n) i=,...n n, where x i (S) is the ith row of X(S). This large deviation result regarding the χ distribution is given in Appendix 3. The remainder of the paper is organized as follows. Section analyzes the Lasso estimator when the design matrix is deterministic. the case where the rows of X are i.i.d. Gaussian vectors. Section 3 considers Both sections give () sufficient conditions for the Lasso to be sign consistent and () necessary conditions for the Lasso s sign consistency. In Section 4, simulations demonstrate

8 8 JINZHU JIA, KARL ROHE AND BIN YU the fundamental role of SNR and show that the Lasso performs similarly on both homoscedastic and Poisson-like data. Section 5 gives some concluding thoughts. Proofs are presented in Appendix. Deterministic Design This section examines when the Lasso is sign consistent and when it is not sign consistent under the sparse Poisson-like model for a nonrandom design matrix X. First, some notation, x i (S) = e T i X(S), where e i is the unit vector with ith element one and the rest zero. Because S = {j : β j 0} is the sparsity index set, x i(s) is a row vector of dimension q. Define β (S) = (β j ) j S Suppose the Irrepresentable Condition holds. (0,, and b = sign(β (S)). ( X(Sc ) T X(S) X(S) X(S)) T b That is, for some constant η η. (5) The l norm of a vector,, is defined as the vector s largest element in absolute value. In addition, assume that ( ) Λ min n X(S)T X(S) C min > 0, (6) where Λ min denotes the minimal eigenvalue and C min is some positive constant. Condition (6) guarantees that matrix X(S) T X(S) is invertible. These conditions are also needed in Wainwright (009) for sign consistency of the Lasso under the standard model. Define ( ) Ψ(X, β, λ) = λ η (C min ) / + n X(S)T X(S) b with which: Theorem. Suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n.,

9 THE LASSO UNDER HETEROSCEDASTICITY 9 Assume that (5) and (6) hold. If λ satisfies M(β ) > Ψ(X, β, λ), then with probability greater than { nλ η } exp σ β + log(p), max i n x i (S) the Lasso has a unique solution ˆβ(λ) with ˆβ(λ) = s β. Theorem can be thought as a straightforward result from Theorem in Wainwright (009). In Wainwright (009), sign consistency of the Lasso estimate is given for a standard model with sub-gaussian noise with parameter σ. In the Poisson-like model, since var(ɛ i x i ) = σ x T i β σ max i x i (S) β, the noise can be thought of as sub-gaussian variables with parameter σ max i x i (S) β. To make this paper self-contained, a proof of Theorem is given in Appendix.. Theorem gives a non-asymptotic result on the Lasso s sparsity pattern recovery property. The next corollary specifies a sequence of λ s that can asymptotically recover the true sparsity pattern. The essential requirements are that () Define, nλ max i x i (S) β log(p + ) and () M(β ) > Ψ(X, β, λ). Γ(X, β, σ ) = η SNR 8 max i x i (S) (η C / min + q C min ) log(p + ). Corollary. As in Theorem, suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n. Assume that (5) and (6) hold. Take λ such that then ˆβ(λ) = s β with probability greater than M(β ) λ = ( η C / min + ), (7) q Cmin exp { ( Γ(X, β, σ, α) ) log(p + ) }. If Γ(X, β, σ ), then P ˆβ(λ) = s β converges to one.

10 0 JINZHU JIA, KARL ROHE AND BIN YU A proof of this corollary can be found in Appendix.. This corollary gives a class of heteroscedastic models for which the Lasso gives a sign consistent estimate of β. This class requires that Γ(X, β, σ ) which means that SNR = nm(β ) σ β ( ) = Ω q log(p + ) max x i (S), (8) i where a n = Ω(b n ) means that a n grows faster than b n, that is, a n /b n. In other words, this condition requires that SN R grows fast enough. For a moment, suppose that the errors are homoscedastic and var(ɛ i ) = σ. The exact same theorem could be proven for this homoscedastic model by replacing σ β with σ and removing max i x i. This shows that when a measure of the signal to noise ratio is large, the Lasso can select the true model in both the case of homoscedastic noise and Poisson-like noise. The next corollary addresses the classical setting, where p, q, and β are all fixed and n goes to infinity. While this is a straightforward result from Corollary, it removes some of the complexities and leads to good intuition. Since M(β ) and β do not change with n, Γ(X, β, σ, α) in Corollary when n max i n x i (S) 0. Then: Corollary. As in Theorem, suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n. Assume that (5) and (6) hold. In the classical case when p, q and β are fixed, if then by choosing λ as in equation (7), P ˆβ(λ) =s β as n. n max i n x i(s) 0, (9) Condition (9) is not strong and it is easy to be satisfied. Suppose ( ) 0 < Λ max n X(S)T X(S) C max <, where Λ max ( ) is the maximum eigenvalue of a matrix and C max is a positive

11 constant, then x i (S) n THE LASSO UNDER HETEROSCEDASTICITY = e T n i X(S) ( ) Λ max n X(S)T X(S) C max. Consequently, n max x i(s) i n = max n i n x i (S) n C / n max 0. Corollary states that in the classical settings, the Lasso can consistently select the true model under the sparse Poisson-like model. So far the results have given sufficient conditions for sign consistency of the Lasso. To further understand how the sign consistency of the Lasso might be sensitive to the heteroscedastic model, the next theorem gives necessary conditions on the ratio of β j to the noise level. Theorem (Necessary Conditions). Suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n. Assume that (6) holds. (a) Consider n X(S)T X(S) = I q q. For any j, define c n,j = σ e T j n βj. (0) X(S) T diag( Xβ )X(S) e j Define c n = min j c n,j. Then, for sign consistency, it is necessary that c n. Specifically, P ˆβ(λ) =s β exp { c n/ } π( + cn ). (b) If the Irrepresentable Condition (5) does not hold, specifically, ( X(Sc ) T X(S) X(S) X(S)) T b, () then, the Lasso estimate is not sign consistent: P ˆβ(λ) =s β. A proof of Theorem can be found in Appendix.3.

12 JINZHU JIA, KARL ROHE AND BIN YU Statement (a) would hold for the homoscedastic model by removing diag( Xβ ) from the denominator in Equation (0). Equation (0) can be viewed as a comparison of the signal strength (βj ) to the noise level (var(xj T ɛ)). Theorem shows that the signal strength needs to be large relative to the noise level. Statement (b) says that the Irrepresentable Condition (5) is necessary for the Lasso s sign consistency. This necessary condition can also be found in both Zhao and Yu (006) and Wainwright (009). Zhao and Yu (006) points out that the Irrepresentable Condition is almost necessary and sufficient for the Lasso to be sign consistent under the standard homosedastic model when p and q are fixed. Wainwright (009) says that it is necessary for the Lasso s sign consistency under the standard model for any p and q. The results in this section have addressed the case when X is fixed. The essential results say that when SNR is large, the Lasso performs well at estimating sign(β ). In the next section X is random. The randomness of X allows us to study how the statistical dependence between X and the noise terms affects the sign consistency of the Lasso. When SNR is large, the Lasso is robust to this violation of independence. 3 Gaussian Random Design Consider the Gaussian random design where rows of X are i.i.d. from a p- dimensional multivariate Gaussian distribution with mean 0 and variance-covariance matrix Σ. Assume the diagonal entries of Σ are all equal to one. Define the variance-covariance matrix of the relevant predictors to be Σ and the covariance between the irrelevant predictors and the relevant predictors to be Σ. Specifically, ( ) Σ = E n X(S)T X(S) and ( ) Σ = E n X(Sc ) T X(S). Let Λ min ( ) denote the minimum eigenvalue of a matrix and Λ max ( ) denote the maximum eigenvalue of a matrix. To get the main results that allow p to grow with n, the following regularity conditions are needed on the p p matrix Σ.

13 THE LASSO UNDER HETEROSCEDASTICITY 3 First, for some positive constants C min and C max that do not depend on n, Λ min (Σ ) C min > 0 and Λ max (Σ) C max <, () and second, the Irrepresentable Condition, Σ (Σ ) b η, (3) for some constant η (0,. Assumptions () and (3) are standard assumptions in the previous research under standard models. Define, V (n, β, λ, σ ) = λ q n C + 3σ Cmax β, min n A(n, β, σ 4σ ) = β log n max(6q, 4 log n) n C and min Ψ(n, β, λ, σ ) = A(n, β, σ ) + λ q. C min These quantities defined above are used in the following theorem. Theorem 3. Consider the sparse Poisson-like model described by (3), under Gaussian random design. Suppose that the variance-covariance matrix Σ satisfies condition () and condition (3) with unit diagonal. Further, suppose that q/n 0. Then for any λ such that M(β ) > Ψ(n, β, λ, σ ), ˆβ(λ) = s β with probability greater than { λ η } exp V (n, β, λ, σ ) C + log(p q) (q+3) exp{ 0.03n} + 3q max n. A proof of Theorem 3 can be found in Appendix.4. Theorem 3 gives a non-asymptotic result on the Lasso s sparsity pattern recovery property, from which the next corollary can be derived. It specifies a sequence of λ s that asymptotically recovers the true sparsity pattern on a well behaved class of models. This class of models restricts the relationship between the data (X), the coefficients (β ), and the distribution of the noise (ɛ). Basically speaking, λ should be chosen such that () λ η V (n, β, λ, σ ) C max log(p q) and () M(β ) > Ψ(n, β, λ, σ ).

14 4 JINZHU JIA, KARL ROHE AND BIN YU Define Γ(n, β, σ ) = nη 4q log(p q + ) C max / C min + 96σ q β log(p q + ) C3 M(β ) A(n, β, σ ) C min Corollary 3. As in Theorem 3, consider the sparse Poisson-like model described by (3), under Gaussian random design. Suppose the variance-covariance matrix Σ satisfies condition () and condition (3) with unit diagnal. Further, suppose that M(β ) > A(n, β, σ ) and q/n 0. Take λ such that λ = M(β ) A(n, β, σ ) C min 4, q then ˆβ(λ) = s β with probability greater than { } exp log(p q + ) Γ(n, β, σ ) (q + 3) exp{ 0.03n} + 3q n. If n/q log(p q + ) and nm(β ) A(n, β, σ ) σ β q log(p q + ) then P ˆβ(λ) = s β converges to one. A proof of Corollary 3 can be found in Appendix.5. max, (4) This corollary gives a class of heteroscedastic models for which the Lasso gives a sign consistent estimate of β, when the predictors are from a Gaussian random ensemble. The sufficient conditions require that M(β ) A(n, β, σ ), which is equivalent to SNR 8 C min log n max(4q, log n). (5) The sufficient conditions also require the conditions in (4), which imply that SNR = Ω (q log(p q + )). (6) These conditions show that when SNR is large, the Lasso can identify the sign of the true predictors. This result is similar to the result for the fixed design case in the previous section and it is similar to results on the standard model..

15 THE LASSO UNDER HETEROSCEDASTICITY 5 The next theorem gives necessary conditions for the Lasso to be sign consistent. It says that the Irrepresentable Condition is necessary for the sign consistency of the Lasso under the sparse Poisson-like model. This condition is also necessary under the homoscedastic model. Theorem 4 (Necessary Conditions). Consider the sparse Poisson-like model described by (3), under Gaussian random design. Suppose the variance-covariance matrix Σ satisfies condition (). If the Irrepresentable Condition (3) does not hold, specifically, Σ (Σ ) b, (7) then, the Lasso estimate is not sign consistent: P ˆβ(λ) =s β ; A proof of Corollary 4 can be found in Appendix.6. This section identified sufficient conditions for the Lasso to be sign consistent when the design matrix is random and Gaussian. The sufficient conditions are similar for both fixed and random design matrices. They are also similar for both homoscedastic noise and Poisson-like noise. In all cases, the conditions require that a measure of the signal to noise ratio is large, see Equations (8) and (6) and the inequality in (5). In the next section, simulations are used to directly compare the performance of the Lasso between the Poisson-like model and the standard homoscedastic model. 4 Simulation Studies There are two examples in this section. The first example investigates a peculiarity of the SNR defined in (4); functions of β appear in both the signal and in the noise. The second example compares the model selection performance of the Lasso under the standard model to the model selection performance of the Lasso under the sparse Poisson-like model. In the first example, all data is generated from the sparse Poisson-like model. In the second example, the performance of the Lasso is compared between the two models of the noise. The parameterizations of the standard homoscedastic models differ in only one respect the noise terms are homoscedastic. To ensure a fair comparison, the variance of the noise terms in the standard model is set equal to the average variance of the noise terms in the corresponding Poisson-like model.

16 6 JINZHU JIA, KARL ROHE AND BIN YU All simulations were done in R with the LARS package (Efron et al., 004). Example This example studies how the Lasso is sensitive to the ratio of signal to noise defined earlier. Recall, SNR = nm(β ) σ β. (8) The theoretical results in the previous sections suggest that when SN R is large, the Lasso is sign consistent. In SNR, β can affect the ratio in two ways. As β grows, so does the variance of the noise term. As M(β ) shrinks, so does the signal. This first example investigates both effects, changing the value of β while keeping M(β ) fixed, and vice-versa. Consider an initial model with the parameters such that n = 400, p = 000, q = 0, σ =, and each element of the design matrix X is drawn independently from N(0, ). Once X is drawn, it will be fixed through all of the simulations. This X is also used in Example. β is designed this way: β max if j 0 βj = β min if j 0 0 otherwise. Changing β min or β changes the value of SNR. To investigate the role of these two quantities, there are two simulation designs. The first simulation design fixes M(β ) = β min = 5 and changes the value of β max. The second simulation design fixes β and changes the value of M(β ). There is one model that is present in both designs. It sets β max = 40 and β min = 5. In this model that is common to both designs, β = 7 and SNR = /7 78. The first simulation design has ten different parameterizations. It sets β min = 5 and chooses β max {00, 90, 80, 70, 60, 50, 40, 30, 0, 0}. Each of these ten different parameterizations creates a different value of SN R. The second simulation design has ten different parameterizations, each fixing β and altering β min such that SNR does not change from the first simulation design (to keep β fixed, β max must change accordingly). The values of the parameters for the two designs are described in the following two tables.

17 THE LASSO UNDER HETEROSCEDASTICITY 7 Table : The design of the first simulation is described in this table. It shows the relationship between β, M(β ), and SNR. M(β ) = 5 is fixed. When β max shrinks, β also shinks, increasing SNR. The numbers in the table are rounded to the nearest integer. β max β SN R Table : The design of the second simulation is described in this table. It shows the relationship between M(β ), β max, and SNR. β min and β max are chosen such that β = 7 is fixed and SNR keeps the same values as in simulation design one. Thus, β min and β max are decided by SNR and β. β min = SNR β /n and β max = β /0 β min. The numbers in the table are rounded. β min = M(β ) β max SN R For each simulation design, the Monte Carlo estimate for the probability of correctly estimating the signs is plotted against SN R in Figure. Each point along the solid line in Figure corresponds to simulation design one (β min is fixed), and each point along the dashed line corresponds to simulation design two ( β is fixed). Success is defined as the existence of a λ that makes ˆβ(λ) = s β. The probability of success for each point is estimated with 500 trials. Figure shows that as SNR increases, the probability of success also increases. What is especially remarkable is the similarity between the solid and dashed lines. This simulation demonstrates that increasing the elements of β can both increase and decrease the probability of successfully estimating the signs. Further, this simulation demonstrates that these effects are well characterized by SNR. Example (Comparison to Standard Model) This example compares the performance of the Lasso applied to homoscedastic data to the performance of the Lasso applied to Poisson-like data. The design matrix X is exactly the

18 8 JINZHU JIA, KARL ROHE AND BIN YU Probability of estimating correct signs Probability of success increases as SNR increases SNR Simulation design one Simulation design two Figure : Probability of Success vs. SNR in Example. For the solid line, β decreases while M(β ) is kept constant. For the dashed line, M(β ) increases while β is kept constant. The values of M(β ) and β are chosen so that SNR as defined in (8) takes the values specified on the horizontal axis. Each probability is estimated with 500 simulations. same (fixed) matrix as in Example and β follows the constructions specified in Tables () and (). The only difference between Example and Example is that the noise terms are homoscedastic in Example. To ensure a fair comparison, the variance of the noise is always set equal to the average variance of the noise terms in the corresponding Poisson-like model. There are four lines drawn in Figure. The solid lines correspond to simulation design one ( β grows while M(β ) is held constant). The dashed lines corresponds to simulation design two (M(β ) shrinks while β is held constant). The bold lines correspond to the simulations on homoscedastic data.

19 THE LASSO UNDER HETEROSCEDASTICITY 9 The thinner lines are identical to the lines in Figure. They are included to compare the performance of the Lasso on homoscedastic data to the performance of the Lasso on Poisson-like data. Exactly as in Example, success is defined as the existence of a λ which makes ˆβ(λ) = s β. The probability of success for each point is estimated with 500 trials. Probability of estimating correct signs Performance on homoscedastic data is similar to performance on Poisson like data Simulation design one, homoscedastic noise Simulation design one, Poisson like noise Simulation design two, homoscedastic noise Simulation design two, Poisson like noise SNR Figure : The bold solid line nearly covers the thinner solid line. This demonstrates how similar the results are for the homoscedastic data and the Poisson-like data. The same statement holds for the dashed lines. In the Poisson-like model, the variance of the noise term is dependent on the predicting variables. In the standard model, the noise term is homoscedastic, independent of the predicting variables. Yet, Figure demonstrates how similar the performance of the Lasso is under these two different models of the noise. In the figure, the thinner lines are nearly indistinguishable from the bold lines,

20 0 JINZHU JIA, KARL ROHE AND BIN YU showing that the Lasso is robust to one type of heteroscedastic noise. 5 Conclusion There is a significant amount of research dedicated to understanding how the Lasso (and similar methods) perform under the standard homoscedastic model. However, in practice, data does not necessarily have the features described by the standard model. This paper aims to understand if the sign consistency of the Lasso is robust to the heteroscedastic errors in the Poisson-like model. This model is motivated by certain problems in high-dimensional medical imaging (Fessler, 000). The key feature of the model is that, for each observation, the variance of the noise term is proportional to the absolute value of the expectation of the observation (var(ɛ i ) E(Y i ) ). In the Poisson-like model, like all heteroscedastic models, the noise term is not independent of the design matrix. In this paper, we analyzed the sign consistency of the standard Lasso when the data comes from the sparse Poisson-like model, showing that the Lasso is robust to one type of violation to the assumption of homoscedasticity. Theorems and 3 give non-asymptotic results for the Lasso s sign consistency property under the sparse Poisson-like model for both deterministic design matrices and random Gaussian ensemble design matrices. Followed by these non-asymptotic results, a suitable λ was chosen such that under sufficient conditions the Lasso is sign consistent. We also studied how sensitive the Lasso is to the heteroscedastic model by finding necessary conditions for sign consistency. The theoretical results for the sparse Poisson-like model are similar to results for the standard model when SNR are matched. In both models, for the Lasso to be sign consistent, it is essential that a measure of the signal to noise ratio is large. The simulations demonstrated what our theory predicted, the Lasso performs similarly in terms of model selection performance on both Poisson-like data and homoscedastic data when the variance of the noise is scaled appropriately. These simulations were across multiple choices of β and multiple choices of the noise level. Taken as a whole, these results suggest that the Lasso is robust to Possion-like heteroscedastic noise. Acknowledgment This work is inspired by a personal communication be-

21 THE LASSO UNDER HETEROSCEDASTICITY tween Bin Yu and Professor Peng Zhang from Capital Normal University in Beijing. We would like to thank Professor Ming Jiang and Vincent Vu for their helpful comments and suggestions on this paper. Jinzhu Jia and Bin Yu are partially supported by NSF grants DMS and SES Karl Rohe is partially supported by a NSF VIGRE Graduate Fellowship. Bin Yu is also partially supported by a grant from MSRA. Appendix Proofs. Proof of Theorem To prove the theorem, we need the next Lemma which gives necessary and sufficient conditions for the Lasso s sign consistency. They are important to the asymptotic analysis. Wainwright (009) gives this condition which follows from KKT conditions. Lemma. For linear model Y = Xβ + ɛ, assume that the matrix X(S) T X(S) is invertible. Then for any given λ > 0 and any noise term ɛ R n, there exists a Lasso estimate ˆβ(λ) which satisfies ˆβ(λ) = s β, if and only if the following two conditions hold X(Sc ) T X(S)(X(S) T X(S)) n X(S)T ɛ λsign(β (S)) n X(Sc ) T ɛ λ, (9) ( sign β (S) + ( ) n X(S)T X(S)) n X(S)T ɛ λsign(β (S)) = sign(β (S)), (0) where the vector inequality and equality are taken elementwise. Moreover, if (9) holds strictly, then ˆβ = ( ˆβ (), 0) is the unique optimal solution to the Lasso problem (), where ˆβ () = β (S) + ( n X(S)T X(S)) n X(S)T ɛ λsign(β ). ()

22 JINZHU JIA, KARL ROHE AND BIN YU Define As in Wainwright (009), we state sufficient conditions for (9) and (0). b = sign(β (S)), and denote by e i the vector with in the ith position and zeroes elsewhere. Define V j = X T j U i = e T i ( n X(S)T X(S)) n X(S)T ɛ λ b { X(S)(X(S) T X(S)) λ b X(S)(X(S) T X(S)) X(S) T ɛ I). n} By rearranging terms, it is easy to see that (9) holds strictly if and only if { } M(V ) = max V j < λ () j S c holds. If we define M(β ) = min j S βj (recall that S = {j : β j sparsity index), then the event M(U) =, 0} is the { } max U i < M(β ), (3) i S is sufficient to guarantee that condition (0) holds. Finally, a proof of Theorem. Proof. This proof is divided into two parts. First we analysis the asymptotic probability of event M(V ), and then we analysis the event of M(U). Analysis of M(V ) : Note from () that M(V ) holds if and only if max j S c V j λ <. Each random variable V j is Gaussian with mean µ j = λxj T X(S)(X(S) T X(S)) b. Define Ṽj = Xj T I X(S)(X(S) T X(S)) X(S) T ɛ n, then V j = µ j + Ṽj. Using condition (5), we have µ j ( η)λ for all j S c, from which we obtain that λ max Ṽj < η max j S c V j <. j S c λ By the Gaussian comparison result (34) stated in Lemma 5, we have P λ max Ṽj λ η η (p q) exp{ j S c max j S c E(Ṽ j )}.

23 Since THE LASSO UNDER HETEROSCEDASTICITY 3 E(Ṽ j ) = n XT j HV AR(ɛ)HX j, where H = I X(S)(X(S) T X(S)) X(S) T which has maximum eigenvalue equal to, and V AR(ɛ) is the variance-covariance matrix of ɛ, which is a diagonal matrix with the ith diagonal element equal to σ x T i β. yields Since x T i β x i (S) β max i x i (S) β, an operator bound E(Ṽ j ) σ n max x i (S) β X j = σ i n max x i (S) β. i Therefore, P λ max Ṽj η j So, we have P λ max V j < j Analysis of M(U) : { (p q) exp nλ η σ max i x i (S) β }. P λ max Ṽj η j { nλ η } (p q) exp σ β. max i x i (S) max U i ( i n X(S)T X(S)) n X(S)T ɛ + λ ( n X(S)T X(S)) b Define Z i := e T i ( n X(S)T X(S)) n X(S)T ɛ. Each Z i is a normal Gaussian with mean 0 and variance var(z i ) = e T i ( n X(S)T X(S)) n X(S)T V AR(ɛ) n X(S)( n X(S)T X(S)) e i σ β max i x i (S) nc min. So, for any t > 0, by (34) P (max Z t nc min i t) q exp{ i S σ β }, max i x i (S) by taking t = λη Cmin, we have P (max Z i λη { ) q exp i S Cmin nλ η σ β max i x i (S) }.

24 4 JINZHU JIA, KARL ROHE AND BIN YU Recall the definition of Ψ(X, β, λ) = λ η (C min ) / + ( n X(S)) X(S)T b we have { P (max U i Ψ(X, β nλ η }, λ)) q exp i σ β. max i x i (S) By condition M(β ) > Ψ(X, β, λ), we have { P (max U i < M(β nλ η } )) q exp i σ β. max i x i (S) At last, we have { nλ η } P M(V )& M(U) p exp σ β max i x i (S),. Proof of Corollary Proof. Recall the definition of Γ(X, β, σ ): Γ(X, β, σ ) = η SNR 8 max i x i (S) (η C / min + qc min ) log(p + ), where SNR = nm(β ) σ β. So, nη σ β = 4Γ(X, β, σ )(η C / min max i x i (S) By taking + q C min ) log(p + ) M(β ) M(β ) λ = ( η C / min + ), q Cmin we have ( ) Ψ(X, β, λ) = λ η (C min ) / + n X(S)T X(S) b λ η C / min + qcmin = M(β ) < M(β ),

25 THE LASSO UNDER HETEROSCEDASTICITY 5 and nλ η σ β max i x i (S) = Γ(X, β, σ ) log(p + ). So, the probability bound in Theorem greater than exp { ( Γ(X, β, σ ) ) log(p + ) }, which goes to one when Γ(X, β, σ, α)..3 Proof of Theorem Proof. First prove (b). Without loss of generality, assume for some j S c, Xj (X(S) X(S)) T X(S) T b = + ζ, then Vj = λ( + ζ) + Ṽj, where Ṽj = ( X(S) X(S) X(S)) T X(S) T I ɛ n is a Gaussian random variable with mean 0, so P (Ṽj > 0) =. So, P (V j > λ), which implies that for any λ, Condition (9) (a necessary condition) is violated with probability greater than /. For claim (a). Condition (0), ( sign β (S) + ( ) n X(S)T X(S)) n X(S)T ɛ λsign(β (S)) = sign(β (S)) is also a necessary condition for sign consistency. Since n X(S)T X(S) = I q q, (0) becomes which implies that ( ) sign β (S) + n X(S)T ɛ λsign(β (S)) = sign(β (S)), ( sign β (S) + ) n X(S)T ɛ = sign(β (S)). (4) Without loss of generality, assume for some j S, βj > 0. Then (4) implies βj + Z j > 0, where Z j = e T j n X(S)T ɛ is a Gaussian random variable with mean

26 6 JINZHU JIA, KARL ROHE AND BIN YU 0, and variance var(z j ) = e T j n X(S)T V AR(ɛ) n X(S)e j σ e T j X(S) T diag( Xβ )X(S) = = β j c, n,j n e j where the last equality uses the definition of c n,j in Theorem. To summarize, P ˆβ(λ) = s β P β j + Z j > 0 = P Z j > β j = P Z j < β j = = βj πvar(zj ) exp{ x var(z j ) }dx β j / var(z j ) π exp{ x }dx x π βj / ( var(z j ) + x + ( + x) ) exp{ x }dx { } exp β j var(z j ) = β π( + j ) var(zj ) { } exp c n,j = π( + cn,j )..4 Proofs of Theorem 3 To prove Theorem 3, we need some preliminary results. Lemma. Conditioned on X(S) and ɛ, the random vector V is Gaussian. Its mean vector is upper bound as EV ɛ, X(S) λ( η). (5)

27 THE LASSO UNDER HETEROSCEDASTICITY 7 Moreover, its conditional covariance takes the form covv ɛ, X(S) = M n Σ = M n Σ Σ (Σ ) Σ, (6) where M n = λ b T (X(S) T X(S)) b + n ɛt I X(S)(X(S) T X(S)) X(S) T ɛ. (7) Lemma 3. Let M = λ b T (X(S) T X(S)) b and M = n ɛ T I X(S)(X(S) T X(S)) X(S) T ɛ, then M n = M + M. We have λ q P n C M λ q max n C exp{ 0.03n}, (8) min P M 3σ Cmax β n n. (9) Lemma 4. P max x i(s) C max max (6q, 4 log n) i=,...n n. (30) Proofs of these lemmas can be found in Appendix.7. Now, we prove Theorem 3. Analysis of M(V ): Define the event T = {M n v }, where v = λ q n C + 3σ Cmax β. min n By Lemma 3, we have P T exp{ 0.03n} + n. Let µ j = EV j ɛ, X(S), Z j = V j µ j, and Z = (Z j ) j S c, then EZ X(S), ɛ = 0 and cov(z X(S), ɛ) = cov(v X(S), ɛ) = M n Σ. From this inequality, we have max V j = max µ j S c j S c j + Z j max j S c µ j + Z j ( η)λ + max Z j. j S c {max j S c Z j < ηλ} {max j S c V j < λ}.

28 8 JINZHU JIA, KARL ROHE AND BIN YU Define Z to be a zero-mean Gaussian with covariance v Σ. Since P max Z j ηλ T c P Z j > ηλ T c j S c j S c we have P max j S c V j λ P This says that (p q) max P Z j > ηλ j S c (p q) exp{ η λ v C max }, max Z j λ T c + P T j S c (p q) exp{ η λ P M(V ) (p q) exp{ η λ v C max } + exp{ 0.03n} + n. v C max } exp{ 0.03n} n. Analysis of M(U): Now we analyze max j S U j. max U j j ( n X(S)T X(S)) n X(S)T ɛ + λ ( n X(S)T X(S)) b. Define Λ i ( ) to be the ith largest eigenvalue of a matrix. Since λ ( n X(S)T X(S)) b λ q Λ min ( n X(S)T X(S)), by Equation (37) in Corollary 4, we have P λ ( n X(S)T X(S)) b λ q exp( 0.03n). C min Let W i = e T i ( n X(S)T X(S)) n X(S)T ɛ, then conditioned on X(S), W i is a Gaussian random variable with mean 0, and variance var(w i X(S)) = e T i ( n X(S)T X(S)) n X(S)T V AR(ɛ) n X(S)( n X(S)T X(S)) e i σ β max i x i (S) nλ min ( n X(S)T X(S)).

29 THE LASSO UNDER HETEROSCEDASTICITY 9 Using (37) P Λ i ( n XT X) C min exp( 0.03n), and Lemma 4, we have σ β max i x i (S) nλ min ( n X(S)T X(S)) σ β C max max (6q, 4 log n) n C min with probability no less than exp{ 0.03n} n. Define event σ β max i x i (S) T = nλ min ( n X(S)T X(S)) σ β C max max (6q, 4 log n) n C min, then P (T ) exp{ 0.03n} n. From the proof of Lemma 5, for any t > 0, P ( W i > t X(S), T ) exp( var(w i X(S), T ) ). The above is also true if we replace var(w i X(S), T ) with any upper bound. So, we have So, P ( W i > t X(S), T ) exp P ( W i > t) P ( W i > t T ) + P (T c ) exp t σ β By takeing t = A(n, β, σ ) := P max W i > A(n, β, σ ) i S t t σ β C max max (6q,4 log n) n C min C max max (6q,4 log n) n C min 4σ β log n max(6q,4 log n) n C min. + exp{ 0.03n} + n., we have q n + q exp{ 0.03n} + q n = 3q + q exp{ 0.03n}. n

30 30 JINZHU JIA, KARL ROHE AND BIN YU Summarize, P max U i A(n, β, σ ) + λ q i C min 3q + q exp{ 0.03n} + exp{ 0.03n}. n At last, we have P M(V ) & M(U) (p q) exp{ η λ v C max } (q+3) exp{ 0.03n} + 3q n..5 Proofs of Corollary 3 Proof. By taking λ = M(β ) A(n,β,σ ) C min 4 q, we have Ψ(n, β, λ, σ ) = A(n, β, σ ) + λ q C min = M(β ) + A(n, β, σ ) < M(β ), where the last inequality uses the assumption that M(β ) > A(n, β, σ ). λ V (n, β, λ, σ ) = = = λ λ q n C + Cmax β 3σ min n q n C + Cmax β 3σ min nλ q 48σ n C + q Cmax β min nm(β ) A(n,β,σ ) C min. By the definition of Γ(n, β, σ ), we have that λ η V (n, β, λ, σ ) C max = log(p q + ) Γ(n, β, σ ),

31 THE LASSO UNDER HETEROSCEDASTICITY 3 so the probability bound in Theorem 3 now becomes, { λ η } exp V (n, β, λ, σ ) C + log(p q) (q + 3) exp{ cn} + 3q max n { } = exp log(p q + ) Γ(n, β, σ ) + log(p q) (q + 3) exp{ cn} + 3q n exp { } log(p q + ) Γ(n, β, σ ) (q + 3) exp{ cn} + 3q n If Condition (4) holds, then Γ(n, β, σ, α) which guarantees P ˆβ(λ) = s β..6 Proof of Theorem 4 Proof. Without loss of generality, assume e T j Σ (Σ ) sign(β (S)) = + ζ, for some j S c and ζ 0. Since EV X(S), ɛ = λσ (Σ ) sign(β (S)), V j, conditioned on X(S) and ɛ, is a Gaussian random variable with mean λ( + ζ). So P V j > λ( + ζ) X(S), ɛ =, which implies P V j > λ X(S), ɛ. Then we have P (V j > λ). So for any λ, P ˆβ(λ) = s β P max V j λ j..7 Proofs of Lemma Lemma 4 Proof of Lemma Proof. Conditioned on X(S) and ɛ, the only random component in V j is the column in the column vector X j, j S c. We know that (X(S c ) X(S), ɛ) (X(S c ) X(S)) is Gaussian with mean and covariance EX(S c ) T X(S), ɛ = Σ (Σ ) X(S) T, (3) var(x(s c ) X(S)) = Σ = Σ Σ (Σ ) Σ. (3)

32 3 JINZHU JIA, KARL ROHE AND BIN YU Consequently, we have, = EV X(S), ɛ { Σ (Σ ) X(S) T X(S)(X(S) T X(S)) λ b X(S)(X(S) T X(S)) X(S) T ɛ } I n = Σ (Σ ) λ b λ( η), where the last inequality uses Condition (3). Now, we compute the elements of the conditional covariance cov(v j, V k ɛ, X(S)). Let α = X(S)(X(S) T X(S)) λ b X(S)(X(S) T X(S)) X(S) T I) ɛ n, then V j = Xj T α. So we have cov(v j, V k ɛ, X(S)) = α T cov(x T j, X T k ɛ, X(S)) α = var(x(sc ) X(S)) jk α T α. Consequently, cov(v ɛ, X(S)) = α T α var(x(s c ) X(S)) = α T ασ = α T ασ Σ (Σ ) Σ. By careful calculation, we have α T α = M n. Proof of Lemma 3 Proof. Recall that M = λ b T (X(S) T X(S)) b. So, λq Λ max (X(S) T X(S)) M λq Λ min (X(S) T X(S)). From (37) we have, λq P n C M λq max n C exp( 0.03n). min Define ϱ = E Z, where Z N(0, ), then for any random variable R N(0, σ ), E R = σϱ. Since x T i β N(0, β (S) T Σ β (S)), we have E x T i β = β (S) T Σ β (S)ϱ.

33 THE LASSO UNDER HETEROSCEDASTICITY 33 We know that M n ɛ T ɛ. Since Eɛ i = EEɛ i X(S) = Eσ x T i β = σ β (S) T Σ β (S)ϱ, and Eɛ 4 i = EEɛ4 i X(S) = 3Eσ4 x T i β = 3σ 4 β (S) T Σ β (S), we have P i ɛ i n σ (ϱ + 3 ϱ ) β (S) T Σ β (S) n = P ɛ i nσ ϱ β (S) T Σ β (S) nσ 3 ϱ β (S) T Σ β (S) So, While i nvar(ɛ i ) n σ 4 (3 ϱ )β (S) T Σ β (S) = 3σ4 β (S) T Σ β (S) σ 4 β (S) T Σ β (S)ϱ nσ 4 (3 ϱ )β (S) T Σ β (S) = n P M σ (ϱ + 3 ϱ ) β (S) T Σ β (S) n n. β T Σ β (S) Cmax β and ϱ = E( Z ) E( Z ) =, where Z is a standard normal random variable, so Then we have σ (ϱ + 3 ϱ ) β (S) T Σ β (S) n P M 3σ Cmax β n n. 3σ Cmax β. n Proof of Lemma 4 Proof. By lemma 6, we have for any t > q, P max i=,...n Σ x i(s) t n exp( t q t ).

34 34 JINZHU JIA, KARL ROHE AND BIN YU Take t = max (6q, 4 log n), we have exp( t q t so, P max i=,...n Σ 6 ) = exp( t ) ) exp( t n, x i(s) max (6q, 4 log n) n. Since Σ x i(s) C x i (S) max, we have P max x i(s) C max max (6q, 4 log n) i=,...n n. (33) Some Gaussian Comparison Results Lemma 5. For any mean zero Gaussian random vector (X,..., X n ), and t > 0, we have P ( max X i t) n exp{ i n max i E(Xi (34) )} Proof. Note that the generate function of X i is t So, for any t > 0, E(e tx i ) = exp{ E(X i )t }. P (X i x) = P (e tx i e tx ) E(etX i ) e tx = exp{ E(X i )t by taking t = x E(Xi ), we have xt}, x P (X i x) exp{ E(Xi )}. So, P ( X i t) = P (X i t) exp{ t E(Xi )} exp{ t max i E(Xi )}. So, P ( max X i t) n exp{ i n max i E(Xi )}. t

35 THE LASSO UNDER HETEROSCEDASTICITY 35 3 Large deviation for χ distribution Lemma 6. Let Z,..., Z n be i.i.d. χ -variates with q degrees of freedom. Then for all t > q, we have P max Z i > t i=,...,n n exp( t q t The proof of this lemma can be found in Obozinski et al. (008). ). (35) 4 Some useful random matrix results In this appendix, we use some known concentration inequalities for the extreme eigenvalues of Gaussian random matrices (Davidson and Szarek, 00) to bound the eigenvalues of a Gaussian random matrix. Although these results hold more generally, our interest here is on scalings (n, q) such that q/n 0. Lemma 7 (Davidson and Szarek (00)). Let Γ R n q be a random matrix whose entries are i.i.d. from N(0, /n), q n. Let the singular values of Γ be s (Γ)... s q (Γ). Then { q max P s (Γ) + n + t, P Using Lemma 7, we now have some useful results. } q s q (Γ) n t < exp{ nt /}. Lemma 8. Let U R n q be a random matrix with elements from the standard normal distribution (i.e., U ij N(0, ), i.i.d.) Assume that q/n 0. Let the eigenvalues of n U T U be Λ ( n U T U)... Λ q ( n U T U). Then when n is big enough, P Λ i( n U T U) exp( 0.03n). (36) Proof. Let Γ = n U, then Λ i ( n U T U) = s i (Γ). By Lemma 7, q P s q (Γ) n t < exp{ nt /}, by taking t = t 0 = P s q (Γ) 0., we have + 0. q n < exp{ nt 0/}.

36 36 JINZHU JIA, KARL ROHE AND BIN YU Since q/n 0 by assumption, we have when n is big enough, q/n < 0., then P s q (Γ) < < exp{ nt 0/}, which implies that, for any i =,..., q, P Λ i ( n (U T U)) < < exp{ nt 0/}. Followed the same procedures, P Λ i ( n (U T U)) > < exp{ nt /}, for t =.. Then inequality (36) holds immediately. Corollary 4. Let X R n q be a random matrix, of which, the rows are i.i.d. from the normal distribution with mean 0 and covariance Σ. Assume that 0 < C min Λ i (Σ) C max < and q/n 0, then when n is big enough, P C min Λ i ( n XT X) C max exp( 0.03n). (37) Proof. Let U = XΣ, then U satisfies the condition in Lemma 8. Then P Λ i( n U T U) exp( 0.03n). Since C min Λ ( n U T U) Λ i ( n XT X) C max Λ q ( n U T U), result (37) is obtained immediately. References Candes, E. and Tao, T Decoding by linear programming. IEEE Trans. Info Theory, 5(), Candes, E. and Tao, T The Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics 35, 6, Chatterjee., S An error bound in the sudakov-fernique inequality. Technical report, UC Berkeley. arxiv:math.pr/05044.

The Lasso under Heteroscedasticity

The Lasso under Heteroscedasticity Jinzhu Jia Karl Rohe Department of Statistics University of California Berkeley, CA 9470, USA Bin Yu Department of Statistics, and Department of Electrical Engineering