arxiv: v1 [stat.ml] 3 Nov 2010

Size: px
Start display at page:

Download "arxiv: v1 [stat.ml] 3 Nov 2010"

Transcription

1 Preprint The Lasso under Heteroscedasticity Jinzhu Jia, Karl Rohe and Bin Yu, arxiv:0.06v stat.ml 3 Nov 00 Department of Statistics and Department of EECS University of California, Berkeley Abstract: The performance of the Lasso is well understood under the assumptions of the standard linear model with homoscedastic noise. However, in several applications, the standard model does not describe the important features of the data. This paper examines how the Lasso performs on a non-standard model that is motivated by medical imaging applications. In these applications, the variance of the noise scales linearly with the expectation of the observation. Like all heteroscedastic models, the noise terms in this Poisson-like model are not independent of the design matrix. More specifically, this paper studies the sign consistency of the Lasso under a sparse Poisson-like model. In addition to studying sufficient conditions for the sign consistency of the Lasso estimate, this paper also gives necessary conditions for sign consistency. Both sets of conditions are comparable to results for the homoscedastic model, showing that when a measure of the signal to noise ratio is large, the Lasso performs well on both Poisson-like data and homoscedastic data. Simulations reveal that the Lasso performs equally well in terms of model selection performance on both Poisson-like data and homoscedastic data (with properly scaled noise variance), across a range of parameterizations. Taken as a whole, these results suggest that the Lasso is robust to the Poisson-like heteroscedastic noise. Key words and phrases: Lasso, Poisson-like Model, Sign Consistency, Heteroscedasticity Introduction The Lasso (Tibshirani, 996) is widely used in high dimensional regression for variable selection. Its model selection performance has been well studied under a standard sparse and homoskedastic regression model. Several researchers have shown that under sparsity and regularity conditions, the Lasso can select the true model asymptotically even when p n (Donoho et al., 006; Meinshausen

2 JINZHU JIA, KARL ROHE AND BIN YU and Buhlmann, 006; Tropp, 006; Wainwright, 009; Zhao and Yu, 006). To define the Lasso estimate, suppose the observed data are independent pairs {(x i, Y i )} R p R for i =,,..., n following the linear regression model Y i = x T i β + ɛ i, () where x T i is a row vector representing the predictors for the ith observation, Y i is the corresponding ith response variable, ɛ i s are independent and mean zero noise terms, and β R p. Use X R n p to denote the n p design matrix with x T k = (X k,..., X kp ) as its kth row and with X j = (X j,..., X jn ) T as its jth column, then x T x X = T. x T n = (X, X,..., X p ). Let Y = (Y,..., Y n ) T and ɛ = (ɛ, ɛ,..., ɛ n ) T R n. The Lasso estimate (Tibshirani, 996) is then defined as the solution to a penalized least squares problem (with regularization parameter λ): ˆβ(λ) = arg min β n Y Xβ + λ β, () where for some vector x R k, x r = ( k i= x i r ) /r. In previous research on the Lasso, the above model has been assumed where the noise terms are i.i.d. and independent of the predictors (hence homoskedastic). We call this the standard model. Candes and Tao (007) suggested that compressed sensing, a sparse method similar to the Lasso could reduce the number of measurements needed by medical imaging technology like Magnetic Resonance Imaging (MRI). This methodology was later applied to MRI by Lustig et al. (008). The standard model was useful to their analyses. However, the standard model is not appropriate for other imaging methods such as PET and SPECT (Fessler, 000). PET provides an indirect measure for the metabolic activity of a specific tissue. To take an image, a biochemical metabolite must be identified that is attractive to the tissue under investigation. This biochemical metabolite is labeled with a positron emitting radioactive material and it is then injected into

3 THE LASSO UNDER HETEROSCEDASTICITY 3 the subject. This substance circulates through the subject, emitting positrons. When the tissue gathers the metabolite, the radioactive material concentrates around the tissue. The positron emissions can be modeled by a Poisson point process in three dimensions with an intensity rate proportional to the varying concentrations of the biochemical metabolite. Therefore, an estimate of the intensity rate is an estimate of the level of biochemcial metabolite. However, the positron emissions are not directly observed. After each positron is emitted it very quickly annihilates a nearby electron, sending two X-ray photons in nearly opposite directions (at the speed of light) Vardi et al. (985). These X-rays are observed by several sensors in a ring surrounding the subject. A physical model of this system informs the estimation of the intensity level of the Poisson process from the observed data. It can be expressed as a Poisson model where the sample size n represents the number of sensors; Y is a vector of observed values; βj represents the Poisson intensity rate for a small cubic volume (a voxel) inside the subject; the design matrix X specifies the physics of the tomography and emissions process; and p is the number of voxels wanted, the more voxels, the finer the resolution of the final image. Because the positron emissions are modeled by a Poisson point process, the variance of each observed value Y i is equal to the expected value E(Y i ). Motivated by the Poissonian model, this paper studies the Lasso under the following sparse Poisson-like model: Y = Xβ + ɛ, E(ɛ X) = 0, Cov(ɛ X) = σ diag( Xβ ), ɛ X(S c ) X(S), (3) where σ > 0 and the sparsity index set is defined as S = { j p : β j 0}, with the cardinality q = #S such that 0 < q < p. In the definition of the Poisson-like model, ɛ conditioned on X consists of independent Gaussian variables; Cov(ɛ X), the variance-covariance matrix of ɛ conditioned on X, is σ diag( Xβ ), an n n diagonal matrix with the vector

4 4 JINZHU JIA, KARL ROHE AND BIN YU σ Xβ down the diagonal; and X(S) and X(S c ) denote two matrices consisting of the relevant column vectors ( with nonzero coefficients) and irrelevant column vectors ( with zero coefficients) respectively. This is a heteroscedastic model. Since the Lasso provides a computationally feasible way to select a model (Efron et al., 004; Osborne et al., 000; Rosset, 004; Zhao and Yu, 007), it can be applied in the non-standard settings to give sparse solutions. In this paper we show that the Lasso is robust to the heteroscedastic noise of the sparse Poisson-like model. Under the Poisson-like model, for general scalings of p, q, n, and β, this paper investigates when the Lasso is sign consistent and when it is not with theoretical and simulation studies. Our results are comparable to the results for the standard model; when a measure of the signal to noise ratio is large, the Lasso is sign consistent. This is the first study of the sign consistency of the standard Lasso in a heteroscedastic setting.. Overview of Previous Work The Lasso (Tibshirani, 996) has been a popular technique to simultaneously select a model and provide regularized estimated coefficients. There is a substantial literature on the use of the Lasso for sparsity recovery and subset selection under the standard homoscedastic linear model. This subsection gives only a very brief overview. In noiseless setting (when ɛ = 0), with contributions from a broad range of researchers (Chen et al., 998; Donoho and Huo., 00; E. Candes and Tao., 004; Elad and Bruckstein., 00, 003; Tropp., 004), there is now much understanding of sufficient conditions on deterministic predictors {X i, i =,..., n} and sparsity index S = {j : βj 0} for which the true β can be recovered exactly. Results by Donoho (004), as well as Candes and Tao (005) provide high probability results for random ensembles of X. There is also a substantial body of work focusing on the noisy setting (where ɛ is random noise). Knight and Fu (000) analyze the asymptotic behavior of the optimal solution for fixed dimension (p); not only for L regularization, but for L r regularization with r (0,. Both Tropp (006) and Donoho et al. (006) provide sufficient conditions for the support of the optimal solution to the Lasso

5 THE LASSO UNDER HETEROSCEDASTICITY 5 problem () to be contained within the support of β. Recent work on the use of the Lasso for model selection by Meinshausen and Buhlmann (006), focuses on Gaussian graphical models. Zhao and Yu (006) considers linear regression and more general noise distributions. For the case of Gaussian noise and Gaussian predictors, both papers established that under particular mutual incoherence conditions and the appropriate choice of the regularization parameter λ, the Lasso can recover the sparsity pattern with probability converging to one for particular regimes of n, p and q. Zhao and Yu (006) used a particular mutual incoherence condition, the Irrepresentable Condition, which they show is almost necessary when p is fixed. The Irrepresentable Condition was found in Fuchs (005) and Zou (006) as well. For i.i.d. Gaussian or sub-gaussian noise, Wainwright (009) established a sharp relation between the problem dimension p, the number q of nonzero elements in β, and the number of observations n that are required for sign consistency.. The Contributions in this Paper Before giving the contributions of this paper, some definitions are needed. Define if x > 0 sign(x) = 0 if x = 0 if x < 0. Define = s such that ˆβ(λ) = s β if and only if sign( ˆβ(λ)) = sign(β ) elementwise. Definition. The Lasso is sign consistent if there exists a sequence λ n such that, ( P ˆβ(λn ) = s β ), as n. This paper studies the sign consistency of the Lasso applied to data from the sparse Poisson-like model, giving non-asymptotic results for both the deterministic design and the Gaussian random design. The non-asymptotic results give the probability that ˆβ(λ) = s β, for any λ, p, q, and n. The sign consistency results follow from the non-asymptotic results. This paper also gives necessary conditions for the Lasso to be sign consistent under the sparse Poisson-like model. It is shown that the Irrepresentable Condition is necessary for the Lasso s sign consistency under this model. This condition is also necessary under the standard

6 6 JINZHU JIA, KARL ROHE AND BIN YU model (Wainwright, 009; Zhao and Yu, 006; Zou, 006). The sufficient conditions for both the deterministic design and random Gaussian design require that the variance of the noise is not too large and that the smallest nonzero element of β is not too small. Define the smallest nonzero element of β as M(β ) = min j S β j. For deterministic design, assume that ( ) Λ min n X(S)T X(S) C min > 0, where Λ min ( ) denotes the minimal eigenvalue of a matrix and C min is some positive constant; for random Gaussian design, assume that Λ min (Σ ) C min > 0 and Λ max (Σ) C max <, where Σ R q q is the variance-covariance matrix of the true predictors, Σ R p p is the variance-covariance matrix of all predictors, Λ max ( ) denotes the maximal eigenvalue of a matrix, and C min and C max are some positive constants. An essential quantity for determining the probability of sign recovery is the signal to noise ratio SNR = nm(β ) σ β. (4) The numerator corresponds to the signal strength for sign recovery. The most difficult sign to estimate in β is the element that corresponds to M(β ). When the smallest element is larger, estimating the signs is easier, and the signal is more powerful. The noise term in the denominator contains β because in the heteroscedastic model considered in this paper, β is fundamental in the scaling of the noise. Because this paper is addressing the sign consistency of the Lasso, the definition of the signal in SNR does not correspond to the typical definition. When SN R is large, the Lasso is sign consistent. Specifically, the sufficient condition for deterministic design requires that ( ) SNR = Ω q log(p + ) max x i (S), i

7 THE LASSO UNDER HETEROSCEDASTICITY 7 where a n = Ω(b n ) means that a n grows faster than b n, that is a n /b n. The sufficient conditions for random Gaussian design requires that SNR 8 log n max(4q, log n)/ C min, and SNR = Ω (q log(p q + )). The previous three inequalities all require that the signal to noise ratio is large. This essential result for the Poisson-like model is identical to the corresponding result for the standard model both require that the variance of the noise is small compared to the size of the signal. The simulations in Section 4 support this essential result. Several of the mathematical techniques in this paper come from Wainwright (009). However, our proofs are more involved because the errors in the Poissonlike model are heteroscedastic and dependent on the design matrix. The results in this paper require bounding ɛ T ɛ n and X(S) T ɛ. When the errors are Gaussian and homoscedastic, the first quantity is distributed as χ. When the errors are heteroscedastic, the distribution becomes more complicated. This paper calculates the second moment of ɛ i and bounds ɛt ɛ/n via the Chebyshev inequality. Similarly for the second quantity, when the errors are dependent of the design matrix, the variance of X(S) T ɛ is more difficult to bound. In this paper, the variance of X(S) T ɛ is bounded using P max x i(s) 8 C max max (4q, log n) i=,...n n, where x i (S) is the ith row of X(S). This large deviation result regarding the χ distribution is given in Appendix 3. The remainder of the paper is organized as follows. Section analyzes the Lasso estimator when the design matrix is deterministic. the case where the rows of X are i.i.d. Gaussian vectors. Section 3 considers Both sections give () sufficient conditions for the Lasso to be sign consistent and () necessary conditions for the Lasso s sign consistency. In Section 4, simulations demonstrate

8 8 JINZHU JIA, KARL ROHE AND BIN YU the fundamental role of SNR and show that the Lasso performs similarly on both homoscedastic and Poisson-like data. Section 5 gives some concluding thoughts. Proofs are presented in Appendix. Deterministic Design This section examines when the Lasso is sign consistent and when it is not sign consistent under the sparse Poisson-like model for a nonrandom design matrix X. First, some notation, x i (S) = e T i X(S), where e i is the unit vector with ith element one and the rest zero. Because S = {j : β j 0} is the sparsity index set, x i(s) is a row vector of dimension q. Define β (S) = (β j ) j S Suppose the Irrepresentable Condition holds. (0,, and b = sign(β (S)). ( X(Sc ) T X(S) X(S) X(S)) T b That is, for some constant η η. (5) The l norm of a vector,, is defined as the vector s largest element in absolute value. In addition, assume that ( ) Λ min n X(S)T X(S) C min > 0, (6) where Λ min denotes the minimal eigenvalue and C min is some positive constant. Condition (6) guarantees that matrix X(S) T X(S) is invertible. These conditions are also needed in Wainwright (009) for sign consistency of the Lasso under the standard model. Define ( ) Ψ(X, β, λ) = λ η (C min ) / + n X(S)T X(S) b with which: Theorem. Suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n.,

9 THE LASSO UNDER HETEROSCEDASTICITY 9 Assume that (5) and (6) hold. If λ satisfies M(β ) > Ψ(X, β, λ), then with probability greater than { nλ η } exp σ β + log(p), max i n x i (S) the Lasso has a unique solution ˆβ(λ) with ˆβ(λ) = s β. Theorem can be thought as a straightforward result from Theorem in Wainwright (009). In Wainwright (009), sign consistency of the Lasso estimate is given for a standard model with sub-gaussian noise with parameter σ. In the Poisson-like model, since var(ɛ i x i ) = σ x T i β σ max i x i (S) β, the noise can be thought of as sub-gaussian variables with parameter σ max i x i (S) β. To make this paper self-contained, a proof of Theorem is given in Appendix.. Theorem gives a non-asymptotic result on the Lasso s sparsity pattern recovery property. The next corollary specifies a sequence of λ s that can asymptotically recover the true sparsity pattern. The essential requirements are that () Define, nλ max i x i (S) β log(p + ) and () M(β ) > Ψ(X, β, λ). Γ(X, β, σ ) = η SNR 8 max i x i (S) (η C / min + q C min ) log(p + ). Corollary. As in Theorem, suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n. Assume that (5) and (6) hold. Take λ such that then ˆβ(λ) = s β with probability greater than M(β ) λ = ( η C / min + ), (7) q Cmin exp { ( Γ(X, β, σ, α) ) log(p + ) }. If Γ(X, β, σ ), then P ˆβ(λ) = s β converges to one.

10 0 JINZHU JIA, KARL ROHE AND BIN YU A proof of this corollary can be found in Appendix.. This corollary gives a class of heteroscedastic models for which the Lasso gives a sign consistent estimate of β. This class requires that Γ(X, β, σ ) which means that SNR = nm(β ) σ β ( ) = Ω q log(p + ) max x i (S), (8) i where a n = Ω(b n ) means that a n grows faster than b n, that is, a n /b n. In other words, this condition requires that SN R grows fast enough. For a moment, suppose that the errors are homoscedastic and var(ɛ i ) = σ. The exact same theorem could be proven for this homoscedastic model by replacing σ β with σ and removing max i x i. This shows that when a measure of the signal to noise ratio is large, the Lasso can select the true model in both the case of homoscedastic noise and Poisson-like noise. The next corollary addresses the classical setting, where p, q, and β are all fixed and n goes to infinity. While this is a straightforward result from Corollary, it removes some of the complexities and leads to good intuition. Since M(β ) and β do not change with n, Γ(X, β, σ, α) in Corollary when n max i n x i (S) 0. Then: Corollary. As in Theorem, suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n. Assume that (5) and (6) hold. In the classical case when p, q and β are fixed, if then by choosing λ as in equation (7), P ˆβ(λ) =s β as n. n max i n x i(s) 0, (9) Condition (9) is not strong and it is easy to be satisfied. Suppose ( ) 0 < Λ max n X(S)T X(S) C max <, where Λ max ( ) is the maximum eigenvalue of a matrix and C max is a positive

11 constant, then x i (S) n THE LASSO UNDER HETEROSCEDASTICITY = e T n i X(S) ( ) Λ max n X(S)T X(S) C max. Consequently, n max x i(s) i n = max n i n x i (S) n C / n max 0. Corollary states that in the classical settings, the Lasso can consistently select the true model under the sparse Poisson-like model. So far the results have given sufficient conditions for sign consistency of the Lasso. To further understand how the sign consistency of the Lasso might be sensitive to the heteroscedastic model, the next theorem gives necessary conditions on the ratio of β j to the noise level. Theorem (Necessary Conditions). Suppose that data (X, Y ) follows the sparse Poisson-like model described by Equations (3) and each column of X is normalized to l -norm n. Assume that (6) holds. (a) Consider n X(S)T X(S) = I q q. For any j, define c n,j = σ e T j n βj. (0) X(S) T diag( Xβ )X(S) e j Define c n = min j c n,j. Then, for sign consistency, it is necessary that c n. Specifically, P ˆβ(λ) =s β exp { c n/ } π( + cn ). (b) If the Irrepresentable Condition (5) does not hold, specifically, ( X(Sc ) T X(S) X(S) X(S)) T b, () then, the Lasso estimate is not sign consistent: P ˆβ(λ) =s β. A proof of Theorem can be found in Appendix.3.

12 JINZHU JIA, KARL ROHE AND BIN YU Statement (a) would hold for the homoscedastic model by removing diag( Xβ ) from the denominator in Equation (0). Equation (0) can be viewed as a comparison of the signal strength (βj ) to the noise level (var(xj T ɛ)). Theorem shows that the signal strength needs to be large relative to the noise level. Statement (b) says that the Irrepresentable Condition (5) is necessary for the Lasso s sign consistency. This necessary condition can also be found in both Zhao and Yu (006) and Wainwright (009). Zhao and Yu (006) points out that the Irrepresentable Condition is almost necessary and sufficient for the Lasso to be sign consistent under the standard homosedastic model when p and q are fixed. Wainwright (009) says that it is necessary for the Lasso s sign consistency under the standard model for any p and q. The results in this section have addressed the case when X is fixed. The essential results say that when SNR is large, the Lasso performs well at estimating sign(β ). In the next section X is random. The randomness of X allows us to study how the statistical dependence between X and the noise terms affects the sign consistency of the Lasso. When SNR is large, the Lasso is robust to this violation of independence. 3 Gaussian Random Design Consider the Gaussian random design where rows of X are i.i.d. from a p- dimensional multivariate Gaussian distribution with mean 0 and variance-covariance matrix Σ. Assume the diagonal entries of Σ are all equal to one. Define the variance-covariance matrix of the relevant predictors to be Σ and the covariance between the irrelevant predictors and the relevant predictors to be Σ. Specifically, ( ) Σ = E n X(S)T X(S) and ( ) Σ = E n X(Sc ) T X(S). Let Λ min ( ) denote the minimum eigenvalue of a matrix and Λ max ( ) denote the maximum eigenvalue of a matrix. To get the main results that allow p to grow with n, the following regularity conditions are needed on the p p matrix Σ.

13 THE LASSO UNDER HETEROSCEDASTICITY 3 First, for some positive constants C min and C max that do not depend on n, Λ min (Σ ) C min > 0 and Λ max (Σ) C max <, () and second, the Irrepresentable Condition, Σ (Σ ) b η, (3) for some constant η (0,. Assumptions () and (3) are standard assumptions in the previous research under standard models. Define, V (n, β, λ, σ ) = λ q n C + 3σ Cmax β, min n A(n, β, σ 4σ ) = β log n max(6q, 4 log n) n C and min Ψ(n, β, λ, σ ) = A(n, β, σ ) + λ q. C min These quantities defined above are used in the following theorem. Theorem 3. Consider the sparse Poisson-like model described by (3), under Gaussian random design. Suppose that the variance-covariance matrix Σ satisfies condition () and condition (3) with unit diagonal. Further, suppose that q/n 0. Then for any λ such that M(β ) > Ψ(n, β, λ, σ ), ˆβ(λ) = s β with probability greater than { λ η } exp V (n, β, λ, σ ) C + log(p q) (q+3) exp{ 0.03n} + 3q max n. A proof of Theorem 3 can be found in Appendix.4. Theorem 3 gives a non-asymptotic result on the Lasso s sparsity pattern recovery property, from which the next corollary can be derived. It specifies a sequence of λ s that asymptotically recovers the true sparsity pattern on a well behaved class of models. This class of models restricts the relationship between the data (X), the coefficients (β ), and the distribution of the noise (ɛ). Basically speaking, λ should be chosen such that () λ η V (n, β, λ, σ ) C max log(p q) and () M(β ) > Ψ(n, β, λ, σ ).

14 4 JINZHU JIA, KARL ROHE AND BIN YU Define Γ(n, β, σ ) = nη 4q log(p q + ) C max / C min + 96σ q β log(p q + ) C3 M(β ) A(n, β, σ ) C min Corollary 3. As in Theorem 3, consider the sparse Poisson-like model described by (3), under Gaussian random design. Suppose the variance-covariance matrix Σ satisfies condition () and condition (3) with unit diagnal. Further, suppose that M(β ) > A(n, β, σ ) and q/n 0. Take λ such that λ = M(β ) A(n, β, σ ) C min 4, q then ˆβ(λ) = s β with probability greater than { } exp log(p q + ) Γ(n, β, σ ) (q + 3) exp{ 0.03n} + 3q n. If n/q log(p q + ) and nm(β ) A(n, β, σ ) σ β q log(p q + ) then P ˆβ(λ) = s β converges to one. A proof of Corollary 3 can be found in Appendix.5. max, (4) This corollary gives a class of heteroscedastic models for which the Lasso gives a sign consistent estimate of β, when the predictors are from a Gaussian random ensemble. The sufficient conditions require that M(β ) A(n, β, σ ), which is equivalent to SNR 8 C min log n max(4q, log n). (5) The sufficient conditions also require the conditions in (4), which imply that SNR = Ω (q log(p q + )). (6) These conditions show that when SNR is large, the Lasso can identify the sign of the true predictors. This result is similar to the result for the fixed design case in the previous section and it is similar to results on the standard model..

15 THE LASSO UNDER HETEROSCEDASTICITY 5 The next theorem gives necessary conditions for the Lasso to be sign consistent. It says that the Irrepresentable Condition is necessary for the sign consistency of the Lasso under the sparse Poisson-like model. This condition is also necessary under the homoscedastic model. Theorem 4 (Necessary Conditions). Consider the sparse Poisson-like model described by (3), under Gaussian random design. Suppose the variance-covariance matrix Σ satisfies condition (). If the Irrepresentable Condition (3) does not hold, specifically, Σ (Σ ) b, (7) then, the Lasso estimate is not sign consistent: P ˆβ(λ) =s β ; A proof of Corollary 4 can be found in Appendix.6. This section identified sufficient conditions for the Lasso to be sign consistent when the design matrix is random and Gaussian. The sufficient conditions are similar for both fixed and random design matrices. They are also similar for both homoscedastic noise and Poisson-like noise. In all cases, the conditions require that a measure of the signal to noise ratio is large, see Equations (8) and (6) and the inequality in (5). In the next section, simulations are used to directly compare the performance of the Lasso between the Poisson-like model and the standard homoscedastic model. 4 Simulation Studies There are two examples in this section. The first example investigates a peculiarity of the SNR defined in (4); functions of β appear in both the signal and in the noise. The second example compares the model selection performance of the Lasso under the standard model to the model selection performance of the Lasso under the sparse Poisson-like model. In the first example, all data is generated from the sparse Poisson-like model. In the second example, the performance of the Lasso is compared between the two models of the noise. The parameterizations of the standard homoscedastic models differ in only one respect the noise terms are homoscedastic. To ensure a fair comparison, the variance of the noise terms in the standard model is set equal to the average variance of the noise terms in the corresponding Poisson-like model.

16 6 JINZHU JIA, KARL ROHE AND BIN YU All simulations were done in R with the LARS package (Efron et al., 004). Example This example studies how the Lasso is sensitive to the ratio of signal to noise defined earlier. Recall, SNR = nm(β ) σ β. (8) The theoretical results in the previous sections suggest that when SN R is large, the Lasso is sign consistent. In SNR, β can affect the ratio in two ways. As β grows, so does the variance of the noise term. As M(β ) shrinks, so does the signal. This first example investigates both effects, changing the value of β while keeping M(β ) fixed, and vice-versa. Consider an initial model with the parameters such that n = 400, p = 000, q = 0, σ =, and each element of the design matrix X is drawn independently from N(0, ). Once X is drawn, it will be fixed through all of the simulations. This X is also used in Example. β is designed this way: β max if j 0 βj = β min if j 0 0 otherwise. Changing β min or β changes the value of SNR. To investigate the role of these two quantities, there are two simulation designs. The first simulation design fixes M(β ) = β min = 5 and changes the value of β max. The second simulation design fixes β and changes the value of M(β ). There is one model that is present in both designs. It sets β max = 40 and β min = 5. In this model that is common to both designs, β = 7 and SNR = /7 78. The first simulation design has ten different parameterizations. It sets β min = 5 and chooses β max {00, 90, 80, 70, 60, 50, 40, 30, 0, 0}. Each of these ten different parameterizations creates a different value of SN R. The second simulation design has ten different parameterizations, each fixing β and altering β min such that SNR does not change from the first simulation design (to keep β fixed, β max must change accordingly). The values of the parameters for the two designs are described in the following two tables.

17 THE LASSO UNDER HETEROSCEDASTICITY 7 Table : The design of the first simulation is described in this table. It shows the relationship between β, M(β ), and SNR. M(β ) = 5 is fixed. When β max shrinks, β also shinks, increasing SNR. The numbers in the table are rounded to the nearest integer. β max β SN R Table : The design of the second simulation is described in this table. It shows the relationship between M(β ), β max, and SNR. β min and β max are chosen such that β = 7 is fixed and SNR keeps the same values as in simulation design one. Thus, β min and β max are decided by SNR and β. β min = SNR β /n and β max = β /0 β min. The numbers in the table are rounded. β min = M(β ) β max SN R For each simulation design, the Monte Carlo estimate for the probability of correctly estimating the signs is plotted against SN R in Figure. Each point along the solid line in Figure corresponds to simulation design one (β min is fixed), and each point along the dashed line corresponds to simulation design two ( β is fixed). Success is defined as the existence of a λ that makes ˆβ(λ) = s β. The probability of success for each point is estimated with 500 trials. Figure shows that as SNR increases, the probability of success also increases. What is especially remarkable is the similarity between the solid and dashed lines. This simulation demonstrates that increasing the elements of β can both increase and decrease the probability of successfully estimating the signs. Further, this simulation demonstrates that these effects are well characterized by SNR. Example (Comparison to Standard Model) This example compares the performance of the Lasso applied to homoscedastic data to the performance of the Lasso applied to Poisson-like data. The design matrix X is exactly the

18 8 JINZHU JIA, KARL ROHE AND BIN YU Probability of estimating correct signs Probability of success increases as SNR increases SNR Simulation design one Simulation design two Figure : Probability of Success vs. SNR in Example. For the solid line, β decreases while M(β ) is kept constant. For the dashed line, M(β ) increases while β is kept constant. The values of M(β ) and β are chosen so that SNR as defined in (8) takes the values specified on the horizontal axis. Each probability is estimated with 500 simulations. same (fixed) matrix as in Example and β follows the constructions specified in Tables () and (). The only difference between Example and Example is that the noise terms are homoscedastic in Example. To ensure a fair comparison, the variance of the noise is always set equal to the average variance of the noise terms in the corresponding Poisson-like model. There are four lines drawn in Figure. The solid lines correspond to simulation design one ( β grows while M(β ) is held constant). The dashed lines corresponds to simulation design two (M(β ) shrinks while β is held constant). The bold lines correspond to the simulations on homoscedastic data.

19 THE LASSO UNDER HETEROSCEDASTICITY 9 The thinner lines are identical to the lines in Figure. They are included to compare the performance of the Lasso on homoscedastic data to the performance of the Lasso on Poisson-like data. Exactly as in Example, success is defined as the existence of a λ which makes ˆβ(λ) = s β. The probability of success for each point is estimated with 500 trials. Probability of estimating correct signs Performance on homoscedastic data is similar to performance on Poisson like data Simulation design one, homoscedastic noise Simulation design one, Poisson like noise Simulation design two, homoscedastic noise Simulation design two, Poisson like noise SNR Figure : The bold solid line nearly covers the thinner solid line. This demonstrates how similar the results are for the homoscedastic data and the Poisson-like data. The same statement holds for the dashed lines. In the Poisson-like model, the variance of the noise term is dependent on the predicting variables. In the standard model, the noise term is homoscedastic, independent of the predicting variables. Yet, Figure demonstrates how similar the performance of the Lasso is under these two different models of the noise. In the figure, the thinner lines are nearly indistinguishable from the bold lines,

20 0 JINZHU JIA, KARL ROHE AND BIN YU showing that the Lasso is robust to one type of heteroscedastic noise. 5 Conclusion There is a significant amount of research dedicated to understanding how the Lasso (and similar methods) perform under the standard homoscedastic model. However, in practice, data does not necessarily have the features described by the standard model. This paper aims to understand if the sign consistency of the Lasso is robust to the heteroscedastic errors in the Poisson-like model. This model is motivated by certain problems in high-dimensional medical imaging (Fessler, 000). The key feature of the model is that, for each observation, the variance of the noise term is proportional to the absolute value of the expectation of the observation (var(ɛ i ) E(Y i ) ). In the Poisson-like model, like all heteroscedastic models, the noise term is not independent of the design matrix. In this paper, we analyzed the sign consistency of the standard Lasso when the data comes from the sparse Poisson-like model, showing that the Lasso is robust to one type of violation to the assumption of homoscedasticity. Theorems and 3 give non-asymptotic results for the Lasso s sign consistency property under the sparse Poisson-like model for both deterministic design matrices and random Gaussian ensemble design matrices. Followed by these non-asymptotic results, a suitable λ was chosen such that under sufficient conditions the Lasso is sign consistent. We also studied how sensitive the Lasso is to the heteroscedastic model by finding necessary conditions for sign consistency. The theoretical results for the sparse Poisson-like model are similar to results for the standard model when SNR are matched. In both models, for the Lasso to be sign consistent, it is essential that a measure of the signal to noise ratio is large. The simulations demonstrated what our theory predicted, the Lasso performs similarly in terms of model selection performance on both Poisson-like data and homoscedastic data when the variance of the noise is scaled appropriately. These simulations were across multiple choices of β and multiple choices of the noise level. Taken as a whole, these results suggest that the Lasso is robust to Possion-like heteroscedastic noise. Acknowledgment This work is inspired by a personal communication be-

21 THE LASSO UNDER HETEROSCEDASTICITY tween Bin Yu and Professor Peng Zhang from Capital Normal University in Beijing. We would like to thank Professor Ming Jiang and Vincent Vu for their helpful comments and suggestions on this paper. Jinzhu Jia and Bin Yu are partially supported by NSF grants DMS and SES Karl Rohe is partially supported by a NSF VIGRE Graduate Fellowship. Bin Yu is also partially supported by a grant from MSRA. Appendix Proofs. Proof of Theorem To prove the theorem, we need the next Lemma which gives necessary and sufficient conditions for the Lasso s sign consistency. They are important to the asymptotic analysis. Wainwright (009) gives this condition which follows from KKT conditions. Lemma. For linear model Y = Xβ + ɛ, assume that the matrix X(S) T X(S) is invertible. Then for any given λ > 0 and any noise term ɛ R n, there exists a Lasso estimate ˆβ(λ) which satisfies ˆβ(λ) = s β, if and only if the following two conditions hold X(Sc ) T X(S)(X(S) T X(S)) n X(S)T ɛ λsign(β (S)) n X(Sc ) T ɛ λ, (9) ( sign β (S) + ( ) n X(S)T X(S)) n X(S)T ɛ λsign(β (S)) = sign(β (S)), (0) where the vector inequality and equality are taken elementwise. Moreover, if (9) holds strictly, then ˆβ = ( ˆβ (), 0) is the unique optimal solution to the Lasso problem (), where ˆβ () = β (S) + ( n X(S)T X(S)) n X(S)T ɛ λsign(β ). ()

22 JINZHU JIA, KARL ROHE AND BIN YU Define As in Wainwright (009), we state sufficient conditions for (9) and (0). b = sign(β (S)), and denote by e i the vector with in the ith position and zeroes elsewhere. Define V j = X T j U i = e T i ( n X(S)T X(S)) n X(S)T ɛ λ b { X(S)(X(S) T X(S)) λ b X(S)(X(S) T X(S)) X(S) T ɛ I). n} By rearranging terms, it is easy to see that (9) holds strictly if and only if { } M(V ) = max V j < λ () j S c holds. If we define M(β ) = min j S βj (recall that S = {j : β j sparsity index), then the event M(U) =, 0} is the { } max U i < M(β ), (3) i S is sufficient to guarantee that condition (0) holds. Finally, a proof of Theorem. Proof. This proof is divided into two parts. First we analysis the asymptotic probability of event M(V ), and then we analysis the event of M(U). Analysis of M(V ) : Note from () that M(V ) holds if and only if max j S c V j λ <. Each random variable V j is Gaussian with mean µ j = λxj T X(S)(X(S) T X(S)) b. Define Ṽj = Xj T I X(S)(X(S) T X(S)) X(S) T ɛ n, then V j = µ j + Ṽj. Using condition (5), we have µ j ( η)λ for all j S c, from which we obtain that λ max Ṽj < η max j S c V j <. j S c λ By the Gaussian comparison result (34) stated in Lemma 5, we have P λ max Ṽj λ η η (p q) exp{ j S c max j S c E(Ṽ j )}.

23 Since THE LASSO UNDER HETEROSCEDASTICITY 3 E(Ṽ j ) = n XT j HV AR(ɛ)HX j, where H = I X(S)(X(S) T X(S)) X(S) T which has maximum eigenvalue equal to, and V AR(ɛ) is the variance-covariance matrix of ɛ, which is a diagonal matrix with the ith diagonal element equal to σ x T i β. yields Since x T i β x i (S) β max i x i (S) β, an operator bound E(Ṽ j ) σ n max x i (S) β X j = σ i n max x i (S) β. i Therefore, P λ max Ṽj η j So, we have P λ max V j < j Analysis of M(U) : { (p q) exp nλ η σ max i x i (S) β }. P λ max Ṽj η j { nλ η } (p q) exp σ β. max i x i (S) max U i ( i n X(S)T X(S)) n X(S)T ɛ + λ ( n X(S)T X(S)) b Define Z i := e T i ( n X(S)T X(S)) n X(S)T ɛ. Each Z i is a normal Gaussian with mean 0 and variance var(z i ) = e T i ( n X(S)T X(S)) n X(S)T V AR(ɛ) n X(S)( n X(S)T X(S)) e i σ β max i x i (S) nc min. So, for any t > 0, by (34) P (max Z t nc min i t) q exp{ i S σ β }, max i x i (S) by taking t = λη Cmin, we have P (max Z i λη { ) q exp i S Cmin nλ η σ β max i x i (S) }.

24 4 JINZHU JIA, KARL ROHE AND BIN YU Recall the definition of Ψ(X, β, λ) = λ η (C min ) / + ( n X(S)) X(S)T b we have { P (max U i Ψ(X, β nλ η }, λ)) q exp i σ β. max i x i (S) By condition M(β ) > Ψ(X, β, λ), we have { P (max U i < M(β nλ η } )) q exp i σ β. max i x i (S) At last, we have { nλ η } P M(V )& M(U) p exp σ β max i x i (S),. Proof of Corollary Proof. Recall the definition of Γ(X, β, σ ): Γ(X, β, σ ) = η SNR 8 max i x i (S) (η C / min + qc min ) log(p + ), where SNR = nm(β ) σ β. So, nη σ β = 4Γ(X, β, σ )(η C / min max i x i (S) By taking + q C min ) log(p + ) M(β ) M(β ) λ = ( η C / min + ), q Cmin we have ( ) Ψ(X, β, λ) = λ η (C min ) / + n X(S)T X(S) b λ η C / min + qcmin = M(β ) < M(β ),

25 THE LASSO UNDER HETEROSCEDASTICITY 5 and nλ η σ β max i x i (S) = Γ(X, β, σ ) log(p + ). So, the probability bound in Theorem greater than exp { ( Γ(X, β, σ ) ) log(p + ) }, which goes to one when Γ(X, β, σ, α)..3 Proof of Theorem Proof. First prove (b). Without loss of generality, assume for some j S c, Xj (X(S) X(S)) T X(S) T b = + ζ, then Vj = λ( + ζ) + Ṽj, where Ṽj = ( X(S) X(S) X(S)) T X(S) T I ɛ n is a Gaussian random variable with mean 0, so P (Ṽj > 0) =. So, P (V j > λ), which implies that for any λ, Condition (9) (a necessary condition) is violated with probability greater than /. For claim (a). Condition (0), ( sign β (S) + ( ) n X(S)T X(S)) n X(S)T ɛ λsign(β (S)) = sign(β (S)) is also a necessary condition for sign consistency. Since n X(S)T X(S) = I q q, (0) becomes which implies that ( ) sign β (S) + n X(S)T ɛ λsign(β (S)) = sign(β (S)), ( sign β (S) + ) n X(S)T ɛ = sign(β (S)). (4) Without loss of generality, assume for some j S, βj > 0. Then (4) implies βj + Z j > 0, where Z j = e T j n X(S)T ɛ is a Gaussian random variable with mean

26 6 JINZHU JIA, KARL ROHE AND BIN YU 0, and variance var(z j ) = e T j n X(S)T V AR(ɛ) n X(S)e j σ e T j X(S) T diag( Xβ )X(S) = = β j c, n,j n e j where the last equality uses the definition of c n,j in Theorem. To summarize, P ˆβ(λ) = s β P β j + Z j > 0 = P Z j > β j = P Z j < β j = = βj πvar(zj ) exp{ x var(z j ) }dx β j / var(z j ) π exp{ x }dx x π βj / ( var(z j ) + x + ( + x) ) exp{ x }dx { } exp β j var(z j ) = β π( + j ) var(zj ) { } exp c n,j = π( + cn,j )..4 Proofs of Theorem 3 To prove Theorem 3, we need some preliminary results. Lemma. Conditioned on X(S) and ɛ, the random vector V is Gaussian. Its mean vector is upper bound as EV ɛ, X(S) λ( η). (5)

27 THE LASSO UNDER HETEROSCEDASTICITY 7 Moreover, its conditional covariance takes the form covv ɛ, X(S) = M n Σ = M n Σ Σ (Σ ) Σ, (6) where M n = λ b T (X(S) T X(S)) b + n ɛt I X(S)(X(S) T X(S)) X(S) T ɛ. (7) Lemma 3. Let M = λ b T (X(S) T X(S)) b and M = n ɛ T I X(S)(X(S) T X(S)) X(S) T ɛ, then M n = M + M. We have λ q P n C M λ q max n C exp{ 0.03n}, (8) min P M 3σ Cmax β n n. (9) Lemma 4. P max x i(s) C max max (6q, 4 log n) i=,...n n. (30) Proofs of these lemmas can be found in Appendix.7. Now, we prove Theorem 3. Analysis of M(V ): Define the event T = {M n v }, where v = λ q n C + 3σ Cmax β. min n By Lemma 3, we have P T exp{ 0.03n} + n. Let µ j = EV j ɛ, X(S), Z j = V j µ j, and Z = (Z j ) j S c, then EZ X(S), ɛ = 0 and cov(z X(S), ɛ) = cov(v X(S), ɛ) = M n Σ. From this inequality, we have max V j = max µ j S c j S c j + Z j max j S c µ j + Z j ( η)λ + max Z j. j S c {max j S c Z j < ηλ} {max j S c V j < λ}.

28 8 JINZHU JIA, KARL ROHE AND BIN YU Define Z to be a zero-mean Gaussian with covariance v Σ. Since P max Z j ηλ T c P Z j > ηλ T c j S c j S c we have P max j S c V j λ P This says that (p q) max P Z j > ηλ j S c (p q) exp{ η λ v C max }, max Z j λ T c + P T j S c (p q) exp{ η λ P M(V ) (p q) exp{ η λ v C max } + exp{ 0.03n} + n. v C max } exp{ 0.03n} n. Analysis of M(U): Now we analyze max j S U j. max U j j ( n X(S)T X(S)) n X(S)T ɛ + λ ( n X(S)T X(S)) b. Define Λ i ( ) to be the ith largest eigenvalue of a matrix. Since λ ( n X(S)T X(S)) b λ q Λ min ( n X(S)T X(S)), by Equation (37) in Corollary 4, we have P λ ( n X(S)T X(S)) b λ q exp( 0.03n). C min Let W i = e T i ( n X(S)T X(S)) n X(S)T ɛ, then conditioned on X(S), W i is a Gaussian random variable with mean 0, and variance var(w i X(S)) = e T i ( n X(S)T X(S)) n X(S)T V AR(ɛ) n X(S)( n X(S)T X(S)) e i σ β max i x i (S) nλ min ( n X(S)T X(S)).

29 THE LASSO UNDER HETEROSCEDASTICITY 9 Using (37) P Λ i ( n XT X) C min exp( 0.03n), and Lemma 4, we have σ β max i x i (S) nλ min ( n X(S)T X(S)) σ β C max max (6q, 4 log n) n C min with probability no less than exp{ 0.03n} n. Define event σ β max i x i (S) T = nλ min ( n X(S)T X(S)) σ β C max max (6q, 4 log n) n C min, then P (T ) exp{ 0.03n} n. From the proof of Lemma 5, for any t > 0, P ( W i > t X(S), T ) exp( var(w i X(S), T ) ). The above is also true if we replace var(w i X(S), T ) with any upper bound. So, we have So, P ( W i > t X(S), T ) exp P ( W i > t) P ( W i > t T ) + P (T c ) exp t σ β By takeing t = A(n, β, σ ) := P max W i > A(n, β, σ ) i S t t σ β C max max (6q,4 log n) n C min C max max (6q,4 log n) n C min 4σ β log n max(6q,4 log n) n C min. + exp{ 0.03n} + n., we have q n + q exp{ 0.03n} + q n = 3q + q exp{ 0.03n}. n

30 30 JINZHU JIA, KARL ROHE AND BIN YU Summarize, P max U i A(n, β, σ ) + λ q i C min 3q + q exp{ 0.03n} + exp{ 0.03n}. n At last, we have P M(V ) & M(U) (p q) exp{ η λ v C max } (q+3) exp{ 0.03n} + 3q n..5 Proofs of Corollary 3 Proof. By taking λ = M(β ) A(n,β,σ ) C min 4 q, we have Ψ(n, β, λ, σ ) = A(n, β, σ ) + λ q C min = M(β ) + A(n, β, σ ) < M(β ), where the last inequality uses the assumption that M(β ) > A(n, β, σ ). λ V (n, β, λ, σ ) = = = λ λ q n C + Cmax β 3σ min n q n C + Cmax β 3σ min nλ q 48σ n C + q Cmax β min nm(β ) A(n,β,σ ) C min. By the definition of Γ(n, β, σ ), we have that λ η V (n, β, λ, σ ) C max = log(p q + ) Γ(n, β, σ ),

31 THE LASSO UNDER HETEROSCEDASTICITY 3 so the probability bound in Theorem 3 now becomes, { λ η } exp V (n, β, λ, σ ) C + log(p q) (q + 3) exp{ cn} + 3q max n { } = exp log(p q + ) Γ(n, β, σ ) + log(p q) (q + 3) exp{ cn} + 3q n exp { } log(p q + ) Γ(n, β, σ ) (q + 3) exp{ cn} + 3q n If Condition (4) holds, then Γ(n, β, σ, α) which guarantees P ˆβ(λ) = s β..6 Proof of Theorem 4 Proof. Without loss of generality, assume e T j Σ (Σ ) sign(β (S)) = + ζ, for some j S c and ζ 0. Since EV X(S), ɛ = λσ (Σ ) sign(β (S)), V j, conditioned on X(S) and ɛ, is a Gaussian random variable with mean λ( + ζ). So P V j > λ( + ζ) X(S), ɛ =, which implies P V j > λ X(S), ɛ. Then we have P (V j > λ). So for any λ, P ˆβ(λ) = s β P max V j λ j..7 Proofs of Lemma Lemma 4 Proof of Lemma Proof. Conditioned on X(S) and ɛ, the only random component in V j is the column in the column vector X j, j S c. We know that (X(S c ) X(S), ɛ) (X(S c ) X(S)) is Gaussian with mean and covariance EX(S c ) T X(S), ɛ = Σ (Σ ) X(S) T, (3) var(x(s c ) X(S)) = Σ = Σ Σ (Σ ) Σ. (3)

32 3 JINZHU JIA, KARL ROHE AND BIN YU Consequently, we have, = EV X(S), ɛ { Σ (Σ ) X(S) T X(S)(X(S) T X(S)) λ b X(S)(X(S) T X(S)) X(S) T ɛ } I n = Σ (Σ ) λ b λ( η), where the last inequality uses Condition (3). Now, we compute the elements of the conditional covariance cov(v j, V k ɛ, X(S)). Let α = X(S)(X(S) T X(S)) λ b X(S)(X(S) T X(S)) X(S) T I) ɛ n, then V j = Xj T α. So we have cov(v j, V k ɛ, X(S)) = α T cov(x T j, X T k ɛ, X(S)) α = var(x(sc ) X(S)) jk α T α. Consequently, cov(v ɛ, X(S)) = α T α var(x(s c ) X(S)) = α T ασ = α T ασ Σ (Σ ) Σ. By careful calculation, we have α T α = M n. Proof of Lemma 3 Proof. Recall that M = λ b T (X(S) T X(S)) b. So, λq Λ max (X(S) T X(S)) M λq Λ min (X(S) T X(S)). From (37) we have, λq P n C M λq max n C exp( 0.03n). min Define ϱ = E Z, where Z N(0, ), then for any random variable R N(0, σ ), E R = σϱ. Since x T i β N(0, β (S) T Σ β (S)), we have E x T i β = β (S) T Σ β (S)ϱ.

33 THE LASSO UNDER HETEROSCEDASTICITY 33 We know that M n ɛ T ɛ. Since Eɛ i = EEɛ i X(S) = Eσ x T i β = σ β (S) T Σ β (S)ϱ, and Eɛ 4 i = EEɛ4 i X(S) = 3Eσ4 x T i β = 3σ 4 β (S) T Σ β (S), we have P i ɛ i n σ (ϱ + 3 ϱ ) β (S) T Σ β (S) n = P ɛ i nσ ϱ β (S) T Σ β (S) nσ 3 ϱ β (S) T Σ β (S) So, While i nvar(ɛ i ) n σ 4 (3 ϱ )β (S) T Σ β (S) = 3σ4 β (S) T Σ β (S) σ 4 β (S) T Σ β (S)ϱ nσ 4 (3 ϱ )β (S) T Σ β (S) = n P M σ (ϱ + 3 ϱ ) β (S) T Σ β (S) n n. β T Σ β (S) Cmax β and ϱ = E( Z ) E( Z ) =, where Z is a standard normal random variable, so Then we have σ (ϱ + 3 ϱ ) β (S) T Σ β (S) n P M 3σ Cmax β n n. 3σ Cmax β. n Proof of Lemma 4 Proof. By lemma 6, we have for any t > q, P max i=,...n Σ x i(s) t n exp( t q t ).

34 34 JINZHU JIA, KARL ROHE AND BIN YU Take t = max (6q, 4 log n), we have exp( t q t so, P max i=,...n Σ 6 ) = exp( t ) ) exp( t n, x i(s) max (6q, 4 log n) n. Since Σ x i(s) C x i (S) max, we have P max x i(s) C max max (6q, 4 log n) i=,...n n. (33) Some Gaussian Comparison Results Lemma 5. For any mean zero Gaussian random vector (X,..., X n ), and t > 0, we have P ( max X i t) n exp{ i n max i E(Xi (34) )} Proof. Note that the generate function of X i is t So, for any t > 0, E(e tx i ) = exp{ E(X i )t }. P (X i x) = P (e tx i e tx ) E(etX i ) e tx = exp{ E(X i )t by taking t = x E(Xi ), we have xt}, x P (X i x) exp{ E(Xi )}. So, P ( X i t) = P (X i t) exp{ t E(Xi )} exp{ t max i E(Xi )}. So, P ( max X i t) n exp{ i n max i E(Xi )}. t

35 THE LASSO UNDER HETEROSCEDASTICITY 35 3 Large deviation for χ distribution Lemma 6. Let Z,..., Z n be i.i.d. χ -variates with q degrees of freedom. Then for all t > q, we have P max Z i > t i=,...,n n exp( t q t The proof of this lemma can be found in Obozinski et al. (008). ). (35) 4 Some useful random matrix results In this appendix, we use some known concentration inequalities for the extreme eigenvalues of Gaussian random matrices (Davidson and Szarek, 00) to bound the eigenvalues of a Gaussian random matrix. Although these results hold more generally, our interest here is on scalings (n, q) such that q/n 0. Lemma 7 (Davidson and Szarek (00)). Let Γ R n q be a random matrix whose entries are i.i.d. from N(0, /n), q n. Let the singular values of Γ be s (Γ)... s q (Γ). Then { q max P s (Γ) + n + t, P Using Lemma 7, we now have some useful results. } q s q (Γ) n t < exp{ nt /}. Lemma 8. Let U R n q be a random matrix with elements from the standard normal distribution (i.e., U ij N(0, ), i.i.d.) Assume that q/n 0. Let the eigenvalues of n U T U be Λ ( n U T U)... Λ q ( n U T U). Then when n is big enough, P Λ i( n U T U) exp( 0.03n). (36) Proof. Let Γ = n U, then Λ i ( n U T U) = s i (Γ). By Lemma 7, q P s q (Γ) n t < exp{ nt /}, by taking t = t 0 = P s q (Γ) 0., we have + 0. q n < exp{ nt 0/}.

36 36 JINZHU JIA, KARL ROHE AND BIN YU Since q/n 0 by assumption, we have when n is big enough, q/n < 0., then P s q (Γ) < < exp{ nt 0/}, which implies that, for any i =,..., q, P Λ i ( n (U T U)) < < exp{ nt 0/}. Followed the same procedures, P Λ i ( n (U T U)) > < exp{ nt /}, for t =.. Then inequality (36) holds immediately. Corollary 4. Let X R n q be a random matrix, of which, the rows are i.i.d. from the normal distribution with mean 0 and covariance Σ. Assume that 0 < C min Λ i (Σ) C max < and q/n 0, then when n is big enough, P C min Λ i ( n XT X) C max exp( 0.03n). (37) Proof. Let U = XΣ, then U satisfies the condition in Lemma 8. Then P Λ i( n U T U) exp( 0.03n). Since C min Λ ( n U T U) Λ i ( n XT X) C max Λ q ( n U T U), result (37) is obtained immediately. References Candes, E. and Tao, T Decoding by linear programming. IEEE Trans. Info Theory, 5(), Candes, E. and Tao, T The Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics 35, 6, Chatterjee., S An error bound in the sudakov-fernique inequality. Technical report, UC Berkeley. arxiv:math.pr/05044.

The Lasso under Heteroscedasticity

The Lasso under Heteroscedasticity The Lasso under Heteroscedasticity Jinzhu Jia Karl Rohe Department of Statistics University of California Berkeley, CA 9470, USA Bin Yu Department of Statistics, and Department of Electrical Engineering

More information

On Model Selection Consistency of Lasso

On Model Selection Consistency of Lasso On Model Selection Consistency of Lasso Peng Zhao Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Bin Yu Department of Statistics University of Berkeley 367

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA The Annals of Statistics 2009, Vol. 37, No. 1, 246 270 DOI: 10.1214/07-AOS582 Institute of Mathematical Statistics, 2009 LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA BY NICOLAI

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

arxiv: v2 [math.st] 12 Feb 2008

arxiv: v2 [math.st] 12 Feb 2008 arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig

More information

High-dimensional statistics: Some progress and challenges ahead

High-dimensional statistics: Some progress and challenges ahead High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression

The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Lasso-type recovery of sparse representations for high-dimensional data

Lasso-type recovery of sparse representations for high-dimensional data Lasso-type recovery of sparse representations for high-dimensional data Nicolai Meinshausen and Bin Yu Department of Statistics, UC Berkeley December 5, 2006 Abstract The Lasso (Tibshirani, 1996) is an

More information

Sparsity and the Lasso

Sparsity and the Lasso Sparsity and the Lasso Statistical Machine Learning, Spring 205 Ryan Tibshirani (with Larry Wasserman Regularization and the lasso. A bit of background If l 2 was the norm of the 20th century, then l is

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

High-dimensional Statistical Models

High-dimensional Statistical Models High-dimensional Statistical Models Pradeep Ravikumar UT Austin MLSS 2014 1 Curse of Dimensionality Statistical Learning: Given n observations from p(x; θ ), where θ R p, recover signal/parameter θ. For

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

A Comparative Framework for Preconditioned Lasso Algorithms

A Comparative Framework for Preconditioned Lasso Algorithms A Comparative Framework for Preconditioned Lasso Algorithms Fabian L. Wauthier Statistics and WTCHG University of Oxford flw@stats.ox.ac.uk Nebojsa Jojic Microsoft Research, Redmond jojic@microsoft.com

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

arxiv: v1 [math.st] 5 Oct 2009

arxiv: v1 [math.st] 5 Oct 2009 On the conditions used to prove oracle results for the Lasso Sara van de Geer & Peter Bühlmann ETH Zürich September, 2009 Abstract arxiv:0910.0722v1 [math.st] 5 Oct 2009 Oracle inequalities and variable

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Regression and Statistical Inference

Regression and Statistical Inference Regression and Statistical Inference Walid Mnif wmnif@uwo.ca Department of Applied Mathematics The University of Western Ontario, London, Canada 1 Elements of Probability 2 Elements of Probability CDF&PDF

More information

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Emmanuel Candes and Prof. Wotao Yin

More information

arxiv: v1 [math.st] 20 Nov 2009

arxiv: v1 [math.st] 20 Nov 2009 REVISITING MARGINAL REGRESSION arxiv:0911.4080v1 [math.st] 20 Nov 2009 By Christopher R. Genovese Jiashun Jin and Larry Wasserman Carnegie Mellon University May 31, 2018 The lasso has become an important

More information

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, L penalized LAD estimator for high dimensional linear regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, where the overall number of variables

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

A new method on deterministic construction of the measurement matrix in compressed sensing

A new method on deterministic construction of the measurement matrix in compressed sensing A new method on deterministic construction of the measurement matrix in compressed sensing Qun Mo 1 arxiv:1503.01250v1 [cs.it] 4 Mar 2015 Abstract Construction on the measurement matrix A is a central

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls 6976 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls Garvesh Raskutti, Martin J. Wainwright, Senior

More information

Compressed Sensing and Sparse Recovery

Compressed Sensing and Sparse Recovery ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing

More information

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 6, JUNE 2011 3841 Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization Sahand N. Negahban and Martin

More information

Sliced Inverse Regression

Sliced Inverse Regression Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed

More information

Sparse signal recovery using sparse random projections

Sparse signal recovery using sparse random projections Sparse signal recovery using sparse random projections Wei Wang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-69 http://www.eecs.berkeley.edu/pubs/techrpts/2009/eecs-2009-69.html

More information

Guaranteed Sparse Recovery under Linear Transformation

Guaranteed Sparse Recovery under Linear Transformation Ji Liu JI-LIU@CS.WISC.EDU Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706, USA Lei Yuan LEI.YUAN@ASU.EDU Jieping Ye JIEPING.YE@ASU.EDU Department of Computer Science

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 6, JUNE

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 6, JUNE IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 6, JUNE 2010 2967 Information-Theoretic Limits on Sparse Signal Recovery: Dense versus Sparse Measurement Matrices Wei Wang, Member, IEEE, Martin J.

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar

More information

Inference For High Dimensional M-estimates: Fixed Design Results

Inference For High Dimensional M-estimates: Fixed Design Results Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

Estimating Unknown Sparsity in Compressed Sensing

Estimating Unknown Sparsity in Compressed Sensing Estimating Unknown Sparsity in Compressed Sensing Miles Lopes UC Berkeley Department of Statistics CSGF Program Review July 16, 2014 early version published at ICML 2013 Miles Lopes ( UC Berkeley ) estimating

More information

arxiv: v1 [math.st] 8 Jan 2008

arxiv: v1 [math.st] 8 Jan 2008 arxiv:0801.1158v1 [math.st] 8 Jan 2008 Hierarchical selection of variables in sparse high-dimensional regression P. J. Bickel Department of Statistics University of California at Berkeley Y. Ritov Department

More information

High-dimensional Statistics

High-dimensional Statistics High-dimensional Statistics Pradeep Ravikumar UT Austin Outline 1. High Dimensional Data : Large p, small n 2. Sparsity 3. Group Sparsity 4. Low Rank 1 Curse of Dimensionality Statistical Learning: Given

More information

Saharon Rosset 1 and Ji Zhu 2

Saharon Rosset 1 and Ji Zhu 2 Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.

More information

Discussion of High-dimensional autocovariance matrices and optimal linear prediction,

Discussion of High-dimensional autocovariance matrices and optimal linear prediction, Electronic Journal of Statistics Vol. 9 (2015) 1 10 ISSN: 1935-7524 DOI: 10.1214/15-EJS1007 Discussion of High-dimensional autocovariance matrices and optimal linear prediction, Xiaohui Chen University

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department

More information

On Sparsity, Redundancy and Quality of Frame Representations

On Sparsity, Redundancy and Quality of Frame Representations On Sparsity, Redundancy and Quality of Frame Representations Mehmet Açaaya Division of Engineering and Applied Sciences Harvard University Cambridge, MA Email: acaaya@fasharvardedu Vahid Taroh Division

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

arxiv: v1 [math.st] 13 Feb 2012

arxiv: v1 [math.st] 13 Feb 2012 Sparse Matrix Inversion with Scaled Lasso Tingni Sun and Cun-Hui Zhang Rutgers University arxiv:1202.2723v1 [math.st] 13 Feb 2012 Address: Department of Statistics and Biostatistics, Hill Center, Busch

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

A Significance Test for the Lasso

A Significance Test for the Lasso A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen May 14, 2013 1 Last time Problem: Many clinical covariates which are important to a certain medical

More information

Generalized Concomitant Multi-Task Lasso for sparse multimodal regression

Generalized Concomitant Multi-Task Lasso for sparse multimodal regression Generalized Concomitant Multi-Task Lasso for sparse multimodal regression Mathurin Massias https://mathurinm.github.io INRIA Saclay Joint work with: Olivier Fercoq (Télécom ParisTech) Alexandre Gramfort

More information

Theoretical results for lasso, MCP, and SCAD

Theoretical results for lasso, MCP, and SCAD Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical

More information

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 12, DECEMBER 2008 2009 Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation Yuanqing Li, Member, IEEE, Andrzej Cichocki,

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Zhilin Zhang and Ritwik Giri Motivation Sparse Signal Recovery is an interesting

More information

Stable Signal Recovery from Incomplete and Inaccurate Measurements

Stable Signal Recovery from Incomplete and Inaccurate Measurements Stable Signal Recovery from Incomplete and Inaccurate Measurements EMMANUEL J. CANDÈS California Institute of Technology JUSTIN K. ROMBERG California Institute of Technology AND TERENCE TAO University

More information

11 : Gaussian Graphic Models and Ising Models

11 : Gaussian Graphic Models and Ising Models 10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood

More information

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery : Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical

More information

Greedy and Relaxed Approximations to Model Selection: A simulation study

Greedy and Relaxed Approximations to Model Selection: A simulation study Greedy and Relaxed Approximations to Model Selection: A simulation study Guilherme V. Rocha and Bin Yu April 6, 2008 Abstract The Minimum Description Length (MDL) principle is an important tool for retrieving

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso) Electronic Journal of Statistics Vol. 0 (2010) ISSN: 1935-7524 The adaptive the thresholded Lasso for potentially misspecified models ( a lower bound for the Lasso) Sara van de Geer Peter Bühlmann Seminar

More information

Shifting Inequality and Recovery of Sparse Signals

Shifting Inequality and Recovery of Sparse Signals Shifting Inequality and Recovery of Sparse Signals T. Tony Cai Lie Wang and Guangwu Xu Astract In this paper we present a concise and coherent analysis of the constrained l 1 minimization method for stale

More information

Bayesian Methods for Sparse Signal Recovery

Bayesian Methods for Sparse Signal Recovery Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Jason Palmer, Zhilin Zhang and Ritwik Giri Motivation Motivation Sparse Signal Recovery

More information

An Introduction to Sparse Approximation

An Introduction to Sparse Approximation An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

ROP: MATRIX RECOVERY VIA RANK-ONE PROJECTIONS 1. BY T. TONY CAI AND ANRU ZHANG University of Pennsylvania

ROP: MATRIX RECOVERY VIA RANK-ONE PROJECTIONS 1. BY T. TONY CAI AND ANRU ZHANG University of Pennsylvania The Annals of Statistics 2015, Vol. 43, No. 1, 102 138 DOI: 10.1214/14-AOS1267 Institute of Mathematical Statistics, 2015 ROP: MATRIX RECOVERY VIA RANK-ONE PROJECTIONS 1 BY T. TONY CAI AND ANRU ZHANG University

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Composite Loss Functions and Multivariate Regression; Sparse PCA

Composite Loss Functions and Multivariate Regression; Sparse PCA Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.

More information

Within Group Variable Selection through the Exclusive Lasso

Within Group Variable Selection through the Exclusive Lasso Within Group Variable Selection through the Exclusive Lasso arxiv:1505.07517v1 [stat.me] 28 May 2015 Frederick Campbell Department of Statistics, Rice University and Genevera Allen Department of Statistics,

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information