A First Order Free Lunch for SQRT-Lasso

Size: px

Start display at page:

Download "A First Order Free Lunch for SQRT-Lasso"

Eustacia Cross
5 years ago
Views:

1 A First Order Free Luch for SQRT-Lasso Xigguo Li, Jarvis Haupt, Rama Arora, Ha Liu, Migyi Hog, ad Tuo Zhao arxiv: v [cs.lg] 5 May 06 Abstract May statistical machie learig techiques sacrifice coveiet computatioal structures to gai estimatio robustess ad modelig flexibility. I this paper, we study this fudametal tradeoff through a SQRT-Lasso problem for sparse liear regressio ad sparse precisio matrix estimatio i high dimesios. We explai how ovel optimizatio techiques help address these computatioal challeges. Particularly, we propose a pathwise iterative smoothig shrikage thresholdig algorithm for solvig the SQRT-Lasso optimizatio problem. We further provide a ovel model-based perspective for aalyzig the smoothig optimizatio framework, which allows us to establish a early liear covergece R-liear covergece guaratee for our proposed algorithm. This implies that solvig the SQRT-Lasso optimizatio is almost as easy as solvig the Lasso optimizatio. Moreover, we show that our proposed algorithm ca also be applied to sparse precisio matrix estimatio, ad ejoys good computatioal properties. Numerical experimets are provided to support our theory. Itroductio Give a desig matrix X R d ad a respose vector y R, we cosider a liear model y = Xθ + ɛ, where θ R d is a ukow coefficiet vector, ad ɛ R is a radom oise vector with i.i.d. sub-gaussia etries, E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,. We are iterested i estimatig θ i high dimesios where /d 0. A popular assumptio i high dimesios is that oly a small subset of variables are relevat i modelig, i.e., may etries of θ are zero. To get such a sparse estimator, Tibshirai 996 proposed Lasso, which solves θ = argmi θ y Xθ + λ Lasso θ,. where λ Lasso is the regularizatio parameter ad ecourages the solutio sparsity. The statistical properties of Lasso have bee established i Zhag ad Huag 008; Zhag 009; Bickel et al. Xigguo Li ad Jarvis Haupt are affiliated with Departmet of Electrical ad Computer Egieerig at Uiversity of Miesota, Mieapolis, MN, 55455, USA; Rama Arora ad Tuo Zhao is affiliated with Departmet of Computer Sciece at Johs Hopkis Uiversity Baltimore, MD, 0, USA; Ha Liu is affiliated with Departmet of Operatios Research ad Fiacial Egieerig at Priceto Uiversity, Priceto, NJ 08544, USA; Migyi Hog is affiliated with Departmet of Idustrial ad Maufacturig Systems Egieerig at Iowa State Uiversity. s: lixx66@um.edu, jdhaupt@um.edu, arora@cs.jhu.edu, haliu@priceto.edu, migyi@iastate.edu, tour@cs.jhu.edu

2 009; Negahba et al. 0. I particular, give λ Lasso σ log d/, the Lasso estimator i. attais the miimax optimal rates of covergece i parameter estimatio, θ θ = O P σ s log d/,. where s deotes the umber of ozero etires i θ Ye ad Zhag, 00; Raskutti et al., 0. Despite these favorable properties, the Lasso approach has a sigificat drawback: The selected regularizatio parameter parameter λ Lasso liearly scales with the ukow quatity σ. Therefore, we eed to carefully tue λ Lasso over a wide rage of potetial values i order to get a good fiite-sample performace. To overcome this drawback, Belloi et al. 0 proposed SQRT-Lasso, which solves θ = argmi y Xθ + λ SQRT θ..3 θ R d They further show that SQRT-Lasso require o prior kowledge of σ. Choosig λ SQRT log d/, the SQRT-Lasso estimator i.3 attais the same optimal statistical rate of covergece i parameter estimatio as.. This meas that the regularizatio selectio for SQRT-Lasso does ot scale with σ. We ca easily specify a smaller rage of potetial values for tuig λ SQRT tha Lasso. Besides estimatig θ, SQRT-Lasso ca also estimate σ, which further makes it applicable to sparse precisio matrix estimatio; this is ot the case with Lasso. Specifically, give observatios i.i.d. sampled from a d-variate ormal distributio with mea 0 ad a sparse precisio matrix Θ, Liu ad Wag 0 proposed a estimator based o SQRT-Lasso See more details i 4, ad showed that it attais the miimax optimal statistical rate of covergece i parameter estimatio Θ Θ = O P Θ s log d/, where Θ deotes the spectral orm of Θ i.e., the largest sigular value of Θ, ad s deotes the maximum umber of ozero etries i each colum of Θ i.e. max l Θ jl 0 s. Though SQRT-Lasso simplifies us tuig efforts ad achieves the optimal statistical properties for both sparse liear regressio ad sparse precisio matrix estimatio i high dimesios, the optimizatio problem i.3 is computatioally more challegig tha. for Lasso, because the l loss i SQRT-Lasso does ot have the same ice computatioal structures as the least square loss i Lasso. For example, the l loss ca be odifferetiable, ad does ot have a Lipschitz cotiuous gradiet. Belloi et al. 0 coverted.3 to a secod order coe optimizatio problem, ad further solved it by a iterior poit method; Li et al. 05 the solved.3 by a ADMM algorithm. Neither of them, however, ca scale to large problems. I cotrast, Xiao ad Zhag 03 proposed a efficiet pathwise iterative shrikage thresholdig algorithm PISTA for solvig., which attais a liear covergece to the uique sparse global optimum with high probability. To address this computatioal challege, we propose a pathwise iterative smoothig shrikage thresholdig algorithm PIS TA to solve.3. Specifically, we first apply the cojugate dual The otatio O P is defied i Lie 84 o Page 3

3 smoothig approach to the osmooth l loss Nesterov, 005; Beck ad Teboulle, 0, ad obtai a smooth surrogate deoted by y Xθ µ, where µ > 0 is a smoothig parameter See more details i. We the apply PISTA to solve the partially smoothed optimizatio problem as follows: θ = argmi y Xθ µ + λ SQRT θ..4 θ R d Existig computatioal theory guaratees that our proposed PIS TA algorithm attais a subliear covergece to the global optimum i term of the objective value Nesterov, 005. However, our umerical experimets show that PIS TA achieves far better empirical computatioal performace better tha subliear covergece for solvig SQRT-Lasso, ad is sigificatly more efficiet tha other competig algorithms, ad early as efficiet as Xiao ad Zhag 03 for solvig Lasso. This is because the existig computatioal aalyses of the cojugate dual smoothig approach do ot take certai specific modelig structures ito cosideratio. For example: I The l loss is oly osmooth whe all residuals are equal to zero sigificatly overfitted. But this is very ulikely to happe because we are solvig.3 with a sufficietly large regularizatio; II Although the smoothed l loss is ot strogly covex, if we restrict the solutio to a sparse domai, the smoothed l loss ca behave like strogly covex fuctios over a eighborhood of θ. Motivated by these observatios, we establish a ew computatioal theory for PIS TA, which exploits the above modelig structures. Particularly, we show that PIS TA achieves a early liear covergece R-liear covergece to the uique sparse global optimum for solvig.3 with high probability, ad also gives us a well fitted model. There are two implicatios: I We ca solve the SQRT-Lasso optimizatio as early efficietly as solvig the Lasso optimizatio; II We pay almost o price i optimizatio accuracy whe usig the smoothig approach for solvig the SQRT-Lasso optimizatio, because.4 ad.3 share the same uique sparse global optimum with high probability. As a extesio of our theory for the SQRT-Lasso optimizatio, we further aalyze the computatioal properties of our proposed PIS TA algorithm for sparse precisio matrix estimatio i high dimesios. We show that PIS TA also achieves a R-liear covergece to the uique sparse global optimum with high probability. We provide umerical experimets o simulated ad real data to support our theory. All proofs of our aalysis are deferred to the supplemetary material. Notatios: Give a vector v = v,..., v d R d, we defie vector orms: v = j v j, v = j v j, ad v = max j v j. We deote the umber of ozero etries i v as v 0 = j v j 0. We deote v \j = v,..., v j, v j+,..., v d R d as the subvector of v with the j-th etry removed. Let A {,..., d} be a idex set. We use A to deote the complemetary set to A, i.e. A = {j j {,..., d}, j / A}. We use v A to deote a subvector of v by extractig all etries of v with idices i A. Give a matrix A R d d, we use A j = A j,..., A dj to deote the j-th colum of A, ad A k = A k,..., A kd to deote the k-th row of A. Let Λ max A ad Λ mi A be the largest ad smallest eigevalues of A. We defie A F = j A j ad A = Λ max A A. We deote A \i\j as the submatrix of A with the i-th row ad the j-th colum removed. We deote A i\j as the i-th row of A with its j-th etry removed. Let A {,..., d} 3

colum idices i A. We deote A 0 if A is a positive-defiite matrix.

that A M a or A M a for all N. A a if A = Oa ad A = Ωa simultaeously.

A = oa if δ > 0, N δ N such that A δ a for all N δ, i.e., lim A /a = 0.

Smoothig, II Iterative Shrikage Thresholdig Algorithm ISTA, ad III Pathwise

I The Cojugate Dual Smoothig approach is adopted to obtai a smooth surrogate of l

We deote the smoothed l loss fuctio as y Xθ µ = max z z y Xθ µ z.

admits a closed form solutio: L µ θ = y Xθ µ = { µ y Xθ, if y Xθ < µ y Xθ µ, o.w.

4 be a idex set. We use A AA to deote a submatrix of A by extractig all etries of A with both row ad colum idices i A. We deote A 0 if A is a positive-defiite matrix. Give two real sequeces {A }, {a }, A = Oa or A = Ωa if ad oly if M R + ad N N such that A M a or A M a for all N. A a if A = Oa ad A = Ωa simultaeously. A = O P a if δ 0,, M R + ad N δ N such that P[ A > M a ] < δ for all N δ. A = oa if δ > 0, N δ N such that A δ a for all N δ, i.e., lim A /a = 0. Give a vector x R d ad a real value λ > 0, we deote the soft thresholdig operator S λ x = [sigx j max{ x j λ, 0}] d j=. Algorithm Our proposed algorithm cosists of three compoets: I Cojugate Dual Smoothig, II Iterative Shrikage Thresholdig Algorithm ISTA, ad III Pathwise Optimizatio. I The Cojugate Dual Smoothig approach is adopted to obtai a smooth surrogate of l loss Nesterov, 005; Beck ad Teboulle, 0. We deote the smoothed l loss fuctio as y Xθ µ = max z z y Xθ µ z.. The optimizatio problem i. admits a closed form solutio: L µ θ = y Xθ µ = { µ y Xθ, if y Xθ < µ y Xθ µ, o.w.. We preset several two-dimesioal examples of the smoothed l orm usig differet µ s i Figure. A larger µ itroduces a larger approximatio error, but makes the approximatio smoother. We the cosider the followig partially smoothed optimizatio problem, θ = argmi F µ,λ θ, where F µ,λ θ = L µ θ + λ θ.. θ R d a µ = 0 b µ = 0. c µ = 0.5 d µ = Figure : Examples of x µ = 0 ad x µ with µ = 0., 0.5, ad respectively for x R. II The ISTA Algorithm is applied to solve. Nesterov, 03. Particularly, give θ t at t-th iteratio, we cosider the quadratic approximatio of F µ,λ θ at θ = θ t, Q µ,λ θ, θ t = L µ θ t + L µ θ t θ θ t + Lt θ θt + λ θ,.3 4

5 where L t is a step size parameter determied by the backtrackig lie search. We the take θ t+ = argmi Q µ,λ θ, θ t = S λ/l tθ t L µ θ t /L t, θ For simplicity, we deote θ t+ = T L t+,λ θt. Give a pre-specified precisio ε, we termiate the iteratios whe the approximate KKT coditio holds: ω λ θ t = mi L µ θ t + λg ε. g θ t III The Pathwise Optimizatio is essetially a multistage optimizatio scheme for boostig computatioal performace. We solve. usig a geometrically decreasig sequece of regularizatio parameters λ >... > λ N, where λ N = λ SQRT. This yields a sequece of output solutios θ [],..., θ [N] from sparse to dese. Particularly, at the K-th optimizatio stage, we choose θ [K ] the output solutio of the K -th stage as the iitial solutio, i.e., θ 0 = θ [K ], ad solve. with λ = λ K usig the ISTA algorithm. This is also referred as the warm start iitializatio i existig literature. We summarize our approach i Algorithm. 3 Computatioal ad Statistical Aalysis We first defie the locally restricted strog covexity ad smoothess. Defiitio 3.. Give a costat r R +, let B r = {θ R d : θ θ r}. For ay v, w B r, which satisfies v w 0 s, L µ is locally restricted strogly covex LRSC ad smooth LRSS o B r at sparsity level s if there exist uiversal costats ρ s, ρ + s 0, such that ρ s v w L µ v L µ w L µ w v w ρ+ s v w, 3. We defie the locally restricted coditio umber at sparsity level s as κ s = ρ + s /ρ s. The LRSC ad LRSS properties are locally costraied variats of restricted strog covexity ad smoothess Agarwal et al., 00; Xiao ad Zhag, 03 with respect to a eighborhood of the true model parameter θ, which are keys to establishig the strog covergece guaratees of our proposed algorithm i high dimesios. Next, we itroduce two key assumptios for establishig our computatioal theory. Assumptio 3.. The sequece of the regularizatio parameters satisfies λ N 6 L µ θ. Assumptio 3. requires that λ N is sufficietly large such that the irrelevat variables ca be elimiated Bickel et al., 009; Negahba et al., 0. Assumptio 3.3. L µ satisfies LRSC ad LRSS properties o B r, where r s 8λ N /ρ s + s for some N < N, N Z +, ad λ N > λ N. Specifically, 3. holds with ρ + s + s, ρ s + s 0,, where s = C s > 96κ s + s + 44κ s + ss, C R + is a costat ad κ s + s = ρ + s + s /ρ s + s. Assumptio 3.3 guaratees that L µ satisfies LRSC ad LRSS properties as log as the estimatio error satisfies θ θ r ad the umber of irrelevat coordiates of solutios is bouded by s. 5

6 Algorithm : Pathwise Iterative Smoothig Shrikage Thresholdig Algorithm PIS TA for solvig the SQRT-Lasso optimizatio.4. θ deotes the output solutio correspodig to λ K ; θ t deotes the solutio at the t-th iteratio of the K-th optimizatio stage; ε K is a pre-specified precisio for the K-th optimizatio stage. The lie search procedure is describe i Algorithm. Iput: y, X, N, λ N, ε N, L max > 0 Iitialize: θ [0] 0, λ 0 L µ 0, η λ N /λ 0 /N For: K =,..., N t 0, λ K ηλ K, θ 0 θ [K ], L 0 L max Repeat: t t + L t t mi{ L, L max}, where θ t T L t,λ θ t K Util: ω λk θ t ε K θ θ t Ed For Retur: θ [N] L t LieSearch λ K, θ t, L t Algorithm : Lie search of PIS TA for SQRT-Lasso. Iput: λ K, θ t, L t Iitialize: t L = Lt Repeat: θ t = T Lt,λ θ t K If F µ,λk θ t < Q µ,λ K θ t, θt L t t = L / Ed If Util: F µ,λk θ t Q µ,λ K θ t, θt Retur: L t 3. Computatioal Theory Our aalysis cosists of two phases depedig o the estimatio error ad sparsity of the solutio θ alog the path. Specifically, we deote B s + s r = B r {θ R d : θ θ 0 s + s}. Let N {,..., N} be a cut-off betwee Phase I ad Phase II. We ca show that Phase I correspods to the first N stages of pathwise optimizatio, i which we caot guaratee θ / B s + s r. Thus we oly establish a subliear covergece for Phase I. But Phase I is still computatioally efficiet, 6

7 sice we ca choose reasoably large ε K for K =,..., N to facilitate early stoppig; Phase II correspods to the cosequet N N stages, i which we guaratee θ B s + s r. Thus LRSC ad LRSS hold, ad a liear covergece ca be established accordigly. Theorem 3.4. Suppose Assumptios 3. ad 3.3 hold. Let θ be the output solutio that satisfies ω λk θ ε K of the K-th stage respectively for all K =,..., N. We deote S = {j θj 0}, S = {j θj = 0} ad s = S. Recall that η is the decayig ratio of the geometrically decreasig regularizatio sequece. Give µ σ 4, λ 0 = L µ 0, ad η = λ N /λ 0 /N 5/6,, there exists N {,,..., N} such that the followig results hold: Phase I: Let R = max K N {sup θ θ θ : F µ,λk θ F µ,λk θ 0 }. At the K-th stage, where K =,..., N, we eed at most T K = O iteratios to guaratee X R ε K µ ω λk θ t ε K ad F µ,λk θ F µ,λk θ = O X R T K µ where θ is a global optimum to.4. Moreover, we have θ [N ] B r ad [ θ [N ]] S 0 s + s. Phase II: Let α = 8κ. At the K-th stage, where K = N s + s +,..., N, we have sparse solutios throughout all iteratios, i.e., [θ t ] S c 0 s + s. Moreover, we eed at most T K = O κ s + s log iteratios to guaratee ω λk θ t ε K, κ 3 s + s s λ K ε K F µ,λk θ F µ,λk θ = O α T K ε K λ K s, ad θ θ = O α T K ε K λ K s, where θ is the uique sparse global optimum to.4 with λ K satisfyig [θ ] S 0 s + s. Theorem 3.4 guaratees that PIS TA achieves a R-liear covergece to the uique sparse global optimum to.4 i terms of both objective value ad solutio parameter, which is as early efficiet as Xiao ad Zhag 03 for solvig Lasso with much less turig effort sice λ N is idepedet of σ. This further explais why PIS TA is much more efficiet tha other competig algorithm for solvig SQRT-Lasso such as ADMM ad SOCP + iterior poit method. A geometric iterpretatio of Theorem 3.4 is provided i Figure. The first N stages serve as itermediate processes to facilitate fast covergece to θ [N], which do ot require high precisio solutios. Thus we choose ε K = λ K /4 ε N for K =,..., N such that Phase I is efficiet, as show i Figure 3, ad oly high precisio for the last stage, e.g. ε N = 0 5 because oly the last regularizatio parameter is of our iterest. The total umber of iteratios for computig the etire solutio path is at most N X O R λ N µ + κ s + sn N logκ s + ss λ N /ε N. Moreover, give properly chose µ ad λ N, we guaratee that oe of the liear covergece regio, true model parameter θ ad output solutio θ [N] fall ito the smooth regio. This implies that.3 ad.4 share the same global optimum i Phase II. A formal claim is preseted i 3.., 7

8 Neighborhood of : 8 Iitial Solutio >< Output Solutio for Phase I: Output Solutio for >: 8 Output Solutio for >< Output Solutio for Phase II: >:. Output Solutio for. B s + s r N N+ N b [0] = 0 b [] b [] b [N] b [N+] b [N] Smoothed Regio Overfitted Model Regio of Subliear Cov. Regio of Liear Cov. Figure : A geometric iterpretatio of two phases of covergece. Phase I yellow regio: Subliear covergece, ad Phase II orage regio: Liear covergece. The liear covergece regio, the true parameter θ ad output solutio θ [N] do ot fall ito the smoothed regio. Fµ, K t F µ, K ,..., N N t : # of Iteratios i Each Stage Figure 3: Plots of the objective gaps for all iteratios t of each path followig stage oly a few stages are demostrated for clarity. 3. Statistical Theory To aalyze the statistical properties of our estimator obtaied via PIS TA, we assume that the desig matrix X satisfies the restricted eigevalue coditio as follows. Assumptio 3.5. The desig matrix X satisfies the Restricted Eigevalue RE coditio, i.e., there exist costats ψ mi, ψ max, ϕ mi, ϕ max 0,, which do ot scale with s,, d, such that ψ mi v log d ϕ mi v Xv ψ max v + ϕ max log d v, 3. A wide family of examples satisfy the RE coditio, such as the correlated sub-gaussia radom desig Rudelso ad Zhou, 03, which has bee extesively studied for sparse recovery Cades ad Tao, 005; Bickel et al., 009; Raskutti et al., 00. We ext verify Assumptio 3. ad 3.3 based o the RE coditio i the followig lemma. 8

9 Lemma 3.6. Suppose Assumptio 3.5 holds. Give µ σ log d 4 ad λ N = 4, λ N 6 L µ θ with high probability. Moreover, for large eough, L µ θ satisfies LRSC ad LRSS properties o B r with high probability, where r = Specifically, 3. holds with ρ + s + s 8ψmax σ geeric costat ad κ s + s 64ψ max /ψ mi. σ 8ψ max. ad ρ s + s ψ mi 8σ, where s = C s > 96κ s + s + 44κ s + ss, C R + is a Lemma 3.6 guaratees that Assumptio 3. holds give properly chose µ ad λ N, ad Assumptio 3.3 holds give the desig X satisfyig RE coditio, both with high probability. Therefore, by Theorem 3.4, PIS TA achieves a R-liear covergece to the uique sparse global optimum. I the ext theorem, we characterize the statistical rate of covergece of PIS TA. Theorem 3.7. Suppose Assumptio 3.5 holds. Let the output solutio θ [N] satisfy ω λn θ [N] ε N for a small eough ε N. Give µ σ 4, λ log d N = 4 ad a large eough, we have: θ [N] θ = O P σ s log d/ ad θ [N] θ = O P σs log d/. Moreover, let σ = y X θ [N] be the estimatio of σ. The we have σ σ = O P σs log d/. Theorem 3.7 guaratees that the output solutio θ [N] obtaied by PIS TA achieves the miimax optimal rate of covergece i parameter estimatio Ye ad Zhag, 00; Raskutti et al., 0. The ext propositio shows that.3 ad.4 share the same global optimum, which correspods to a well fitted model. This implies that either the uique sparse global optimum θ [N] or the liear covergece regio falls ito the smooth regio, as show i Figure. Propositio 3.8. Uder the same assumptios as Theorem 3.7, for all λ K s, where K = N +,..., N,.3 ad.4 share the same uique sparse global optimum with high probability. 4 Extesio to Sparse Precisio Matrix Estimatio We cosider the TIGER approach proposed i Liu ad Wag 0 for estimatig sparse precisio matrix. To be clear, we emphasize that we use v, v,... without [ ] for subscript to idex vectors. Let X = [x,..., x ] R d be observed data poits from a d-variate Gaussia distributio N d 0, Σ. Our goal is to estimate the sparse precisio matrix Θ = Σ. Let Z = X Γ / = [z,..., z ] R d be the stadardized data matrix, where Γ = diag Σ,..., Σ dd is a diagoal matrix ad Σ = X X. The, we ca write z i = Z,\i θi θi / = Γ ii Γ / \i,\i Σ \i,\i Σ \i,i ad ɛ i N 0, σi I with σi, ad solve: deote τ i = σ i Γ ii / + Γ ii ɛ i for all i =,..., d, where = Σ ii Σ \i,i Σ \i,\i Σ \i,i. We θ i = argmi L µ,i θ i + λ θ i ad τ i = z i Z,\i θ i µ, 4. θ i R d 9

10 for all i =,..., d, where L µ,i θ i = z i Z,\i θ i µ. The the i-th colum of precisio matrix Θ is estimated by: Θ ii = τ i Γ ii, ad Θ \i,i = τ i Γ / ii Γ / \i,\i θ i. Here we solve 4. by PIS TA. We the itroduce a few mild techical assumptios: Assumptio 4.. Suppose the true covariace matrix Σ ad precisio matrix Θ satisfy: }A Θ Mκ Θ, s = {Θ R d d : Θ 0, Λ max Θ/Λ mi Θ κ Θ, max i j Θ ij 0 s ; A s log d = o; A3 lim sup max i Σ ii log d/ < /4. We first verify the assumptios required by our computatioal theory by the followig lemma. Lemma 4.. Suppose Assumptio 4. holds. Give µ 5 5 log d κ Θ ad λ N = 6, we have λ N 6 max i {,...,d} L µ,i θi with high probability. Moreover, L µ,i θ i satisfies LRSC ad LRSS properties o B ri for all i =,..., d with high probability, where r i = σ i κ Θ. Specifically, for all i =,..., d, 3. holds with ρ + s + s κ Θ σ i ad ρ s + s κ Θ σ i, where s = C 3 s > 96κ s + s + 44κ s + ss for a geeric costat C 3, ad κ s + s 44κ Θ. Lemma 4. guaratees that Assumptio 3. ad 3.3 holds with high probability give properly chose µ ad λ N. Thus, by Theorem 3.4, PIS TA achieves a R-liear covergece to the uique sparse global optimum of 4. with high probability for all colums of Θ. The ext theorem characterizes the statistical rate of covergece of the obtaied precisio matrix estimator usig PIS TA. Theorem 4.3. Suppose Assumptio 4. holds. Let Θ [N] be the output solutio for the regularizatio parameter λ N. Give µ 5 5 log d κ Θ ad λ N = 6, we have Θ [N] Θ = O P s Θ log d/. Theorem 4.3 implies that our obtaied precisio matrix estimator attais the miimax optimal rate of covergece i parameter estimatio. Moreover, we guaratee that either the liear covergece regio or output solutio Θ [N] falls ito the smooth regio with high probability. 5 Numerical Experimets We ivestigate the computatioal performace of the proposed PIS TA algorithm through umerical experimets over both simulated ad real data example. All simulatios are implemeted i C with double precisio usig a PC with a Itel 3.3GHz Core i5 CPU ad 6GB memory. For simulated data, we geerate a traiig dataset of 00 samples, where each row of the desig matrix X i,, i =,..., 00, idepedetly from a 000-dimesioal ormal distributio N0, Σ where Σ jj = ad Σ jk = 0.5 for all k j. We set s = 3 with θ = 3, θ =, ad θ 4 =.5, ad θj = 0 for all j,, 4. A validatio set of 00 samples for the regularizatio parameter selectio ad a testig set of 0, 000 samples are also geerated to evaluate the predictio accuracy. 0

11 We set σ = 0.5,,, 4 respectively to illustrate the tuig isesitivity. The regularizatio parameter of both Lasso ad SQRT-Lasso is chose over a geometrically decreasig sequece {λ K } 50 t=0 with λ 50 = σ log d// for Lasso ad λ 50 = log d// for SQRT-Lasso. The optimal regularizatio parameter is determied by λ opt = λ N as N = argmi K {0,...,50} ỹ X θ, where θ deotes the obtaied estimate usig the regularizatio parameter λ K, ad ỹ ad X deote the respose vector ad desig matrix of the validatio set. For both Lasso ad SQRT-Lasso, we set the stoppig precisio ε K = 0 5 for all K =,..., 30 ad ε K = 0 5 for all K = 3,..., 50. For SQRT-Lasso, we set the smoothig parameter µ = 0 4. First of all, we compare PIS TA with ADMM proposed i Li et al. 05. The backtrackig lie search described i Algorithm is adopted to accelerate both algorithms. We coduct 500 simulatios for all σ s. The results are preseted i Table. The PIS TA ad ADMM algorithms attai similar objective values, but PIS TA is about 0 times faster tha ADMM. Both algorithms also achieve similar estimatio errors. Throughout all 500 simulatios, we have y X θ [ N] µ This implies that all obtaied optimal estimators are outside the smoothed regio of the optimizatio problem, i.e., the smoothig approach does ot hurt the solutio accuracy. Next, we compare the computatioal ad statistical performace betwee Lasso solved by PISTA Xiao ad Zhag 03 ad SQRT-Lasso solved by PIS TA. The results averaged over 500 simulatios are summarized i Tables. I terms of statistical performace, Lasso ad SQRT- Lasso attais similar estimatio ad predictio error. I terms of computatioal performace, the PIS TA algorithm for solvig SQRT-Lasso is as efficiet as PISTA for Lasso, which matches our computatioal aalysis. Moreover, we also examie the optimal regularizatio parameters for Lasso ad SQRT-Lasso. We visualize the distributio of all 500 selected λ opt s usig the kerel desity estimator. Particularly, we adopt the Gaussia kerel, ad the kerel badwidth is selected based o the 0-fold cross validatio. Figure 4 illustrates the estimated desity fuctios. The horizotal axis correspods to the rescaled regularizatio parameter λ opt / log d/. We see that the optimal regularizatio parameters of Lasso sigificatly vary with differet σ. I cotrast, the optimal regularizatio parameters of SQRT-Lasso are more cocetrated. This is cosistet with the claimed tuig isesitivity. Fially, we compare PIS TA with ADMM over real data sets for precisio matrix estimatio. Particularly, we use four real world biology data sets preprocessed by Li ad Toh 00: Estroge d = 69, Arabidopsis d = 834, Leukemia d =, 5, Hereditary d =, 869. We set three differet values for λ N such that the obtaied estimators achieve differet levels of sparse recovery. We set N = 50, ad ε K = 0 4 for all K s. The timig performace is summarized i Table. As ca be see, PIS TA is 5 to 0 times faster tha ADMM o all four data sets 3. We do ot have ay results for the algorithm proposed i Belloi et al. 0, because it failed to fiish 500 simulatios i hours. The implemetatio was based o SDPT3. 3 We do ot have ay results for the algorithm proposed i Belloi et al. 0, because it failed to fiish the experimets o all four data sets i hours. The implemetatio was also based o SDPT3.

12 Table : Quatitative compariso betwee Lasso PISTA ad SQRT-Lasso PIS TA ad ADMM over 500 simulatios. The estimatio error is defied as θ θ. The predictio error is defied as ỹ X θ [ N] /. The residual is defied as y X θ [ N] for PIS TA oly. PIS TA attais early the same estimatio ad predictio errors as ADMM, but is sigificatly faster tha ADMM over all differet settigs. Besides, we also fid that all obtaied estimators are outside the smoothed regio throughout all 500 simulatios. Variace Est. Err. Pred. Err. Time Secod Residual of Noise Lasso PIS TA Lasso PIS TA Lasso PIS TA ADMM PIS TA σ = σ = σ = σ = Table : Timig compariso betwee PIS TA ad ADMM o biology data uder differet levels of sparsity recovery. PIS TA is sigificatly faster tha ADMM over all settigs ad data sets. Estroge Arabidopsis Leukemia Hereditary PIS TA ADMM PIS TA ADMM PIS TA ADMM PIS TA ADMM Sparsity % Sparsity 3% Sparsity 0% =0.5 = = = a Lasso: Distributio of λ opt/ log d/ =0.5 =.5 = = b SQRT-Lasso: Distributio of λ opt/ log d/ Figure 4: Estimated distributios of λ opt / log d/ over differet values of σ for Lasso ad PIS TA.

13 Refereces Agarwal, A., Negahba, S. ad Waiwright, M. J. 00. Fast global covergece rates of gradiet methods for high-dimesioal statistical recovery. I Advaces i Neural Iformatio Processig Systems. Beck, A. ad Teboulle, M. 0. Smoothig ad first order methods: A uified framework. SIAM Joural o Optimizatio Belloi, A., Cherozhukov, V. ad Wag, L. 0. Square-root lasso: pivotal recovery of sparse sigals via coic programmig. Biometrika Bickel, P. J., Ritov, Y. ad Tsybakov, A. B Simultaeous aalysis of lasso ad datzig selector. The Aals of Statistics Bühlma, P. ad va de Geer, S. 0. Statistics for high-dimesioal data: methods, theory ad applicatios. Spriger. Cades, E. J. ad Tao, T Decodig by liear programmig. IEEE Trasactios o Iformatio Theory Johstoe, I. M. 00. Chi-square oracle iequalities. Lecture Notes-Moograph Series Li, L. ad Toh, K.-C. 00. A iexact iterior poit method for l -regularized sparse covariace selectio. Mathematical Programmig Computatio Li, X., Zhao, T., Yua, X. ad Liu, H. 05. The flare package for high dimesioal liear regressio ad precisio matrix estimatio i R. The Joural of Machie Learig Research Liu, H. ad Wag, L. 0. Tiger: A tuig-isesitive approach for optimally estimatig Gaussia graphical models. Tech. rep., Massachusett Istitute of Techology. Negahba, S. N., Ravikumar, P., Waiwright, M. J. ad Yu, B. 0. A uified framework for high-dimesioal aalysis of m-estimators with decomposable regularizers. Statistical Sciece Nesterov, Y Itroductory lectures o covex optimizatio: A basic course, vol. 87. Spriger. Nesterov, Y Smooth miimizatio of o-smooth fuctios. Mathematical Programmig Nesterov, Y. 03. Gradiet methods for miimizig composite fuctios. Mathematical Programmig

14 Raskutti, G., Waiwright, M. J. ad Yu, B. 00. Restricted eigevalue properties for correlated Gaussia desigs. The Joural of Machie Learig Research Raskutti, G., Waiwright, M. J. ad Yu, B. 0. Miimax rates of estimatio for high-dimesioal liear regressio over-balls. Iformatio Theory, IEEE Trasactios o Rudelso, M. ad Zhou, S. 03. Recostructio from aisotropic radom measuremets. Iformatio Theory, IEEE Trasactios o Tibshirai, R Regressio shrikage ad selectio via the lasso. Joural of the Royal Statistical Society, Series B Waiwright, J. 05. High-dimesioal statistics: A o-asymptotic viewpoit. preparatio. Uiversity of Califoria, Berkeley. Xiao, L. ad Zhag, T. 03. A proximal-gradiet homotopy method for the sparse least-squares problem. SIAM Joural o Optimizatio Ye, F. ad Zhag, C.-H. 00. Rate miimaxity of the lasso ad datzig selector for the lq loss i lr balls. The Joural of Machie Learig Research Zhag, C.-H. ad Huag, J The sparsity ad bias of the lasso selectio i high-dimesioal liear regressio. The Aals of Statistics Zhag, T Some sharp performace bouds for least squares regressio with l regularizatio. The Aals of Statistics

15 A Itermediate Results of Theorem 3.4 ad Theorem 3.7 We itroduce some importat implicatios of the proposed assumptios. Recall that S = {j : θj 0} be the idex set of o-zero etries of θ with s = S ad S = {j : θj = 0} be the complemet set. From Lemma 3.6, Assumptio 3.3 implies RSC ad RSS with parameter ρ s + s ad ρ + s + s respectively. By Nesterov 004, the followig coditios are equivalet to RSC ad RSS, i.e., for ay v, w R d satisfyig v w 0 s + s, ρ + s + s ρ s + s v w v w L µ w ρ + s + s v w, L µ v L µ w v w L µ w ρ L µ v L µ w. s + s From the covexity of l orm, we have v w v w g, A. A. A.3 where g w. Combiig ad A. ad A.3, we have for ay v, w R d satisfyig v w 0 s + s, F µ,λ v F µ,λ w v w F µ,λ w ρ s + s v w, A.4 Remark A.. For ay t ad k, the lie search satisfies L t Lt L t max, L µ L Lt L µ ad ρ + s + s where L µ = mi{l : L µ v L µ w L x y, v, w R d }. L t Lt ρ+ s + s, A.5 We first show that whe θ is sparse ad the approximate KKT coditio is satisfied, the both estimatio error i l orm ad objective error, w.r.t. the true model parameter, are bouded. This characterizes that the iitial value θ 0 of K-th path followig stage has desirable statistical properties if we iitialize θ 0 = θ [K ]. This is formalized i Lemma A., ad its proof is deferred to Appedix N.. Lemma A.. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies θ S 0 s ad the approximate KKT coditio the we have mi L µ θ + λg λ/, g θ A.6 θ θ S 5 θ θ S, A.7 θ θ λ s ρ, A.8 s + s θ θ λs ρ, s + s A.9 F µ,λ θ F µ,λ θ 6λ s. A.0 ρ s + s 5

16 Next, we show that if θ is sparse ad the objective error is bouded, the the estimatio error is also bouded. This characterizes that withi the K-th path followig stage, good statistical performace is preserved after each proximal-gradiet update. This is formalized i Lemma A.3, ad its proof is deferred to Appedix N.. Lemma A.3. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies θ S 0 s ad the objective satisfies the we have F µ,λ θ F µ,λ θ 6λ s ρ s + s θ θ 4λ 3s ρ, A. s + s θ θ 4λs ρ. A. s + s We the show that if θ is sparse ad the objective error is bouded, the each proximal-gradiet update preserves solutio to be sparse. This demostrates that withi the K-th path followig stage, each update of θ t is preserved to be sparse ad has good statistical performace. This is formalized i Lemma A.4, ad its proof is deferred to Appedix N.3. Lemma A.4. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies θ S 0 s, L satisfies L < ρ + s + s, ad the objective satisfies the we have F µ,λ θ F µ,λ θ 6λ s ρ s + s T L,λ θ S 0 s. A.3 Moreover, we show that if θ satisfies the approximate KKT coditio, the the objective has a bouded error w.r.t. the regularizatio parameter λ. This characterizes the geometric decrease of the objective error whe we choose a geometrically decreasig sequece of regularizatio parameters. This is formalized i Lemma A.5, ad its proof is deferred to Appedix N.4. Lemma A.5. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies ω λ θ λ/, For ay λ [λ N, λ], let θ = argmi θ F µ, λθ. The we have F µ, λθ F µ, λθ λ + λ ω λ θ + λ λ ρ s + s s. 6

17 Furthermore, we show that each path followig stage has a local liear covergece rate if the iitial value θ 0 is sparse ad satisfies the approximate KKT coditio with adequate precisio. Besides, the estimatio after each proximal gradiet update is also sparse. This is the key result i demostratig the overall geometric covergece rate of the algorithm. This is formalized i Lemma A.6, ad its proof is deferred to Appedix N.5. Lemma A.6. Suppose Assumptio 3.3 holds. If the iitializatio θ 0 for every stage with ay λ i Algorithm satisfies The for ay t =,,..., we have θ t 0 s, where θ = argmi θ F µ,λ θ. F µ,λ θ t F µ,λ θ θ 0 0 s. 8κ s + s t F µ,λθ 0 F µ,λθ, I additio, we provide the subliear covergece rate whe RSC does ot hold based o a refied aalysis of the covergece rate for covex objective ot strogly covex via proximal gradiet method with lie search Nesterov 03. Specifically, we provide a sub-liear rate of covergece without the eed to classify the distace of the iitial objective to the optimal objective. This characterizes the covergece behavior whe Xθ θ is large. We formalize this i Lemma A.7, ad provide the proof i Appedix N.6. Lemma A.7 Refied result of Theorem 4 i Nesterov 03. Give the iitializatio θ 0, if for ay θ R d that satisfies F µ,λ θ F µ,λ θ 0, deote R as The for ay t =,,..., we have where θ = argmi θ F µ,λ θ. θ θ R. F µ,λ θ t F µ,λ θ 4 X R t + µ, A.4 Fially, we itroduce two results characterizig the proximal gradiet mappig operatio, adapted from Nesterov 03 ad Xiao ad Zhag 03 without proof. The first lemma describes sufficiet descet of the objective by proximal gradiet method. Lemma A.8 Adapted from Theorem i Nesterov 03. For ay L > 0, Q µ,λ T L,λ θ, θ F µ,λ θ L T L,λθ θ. 7

18 Besides, if L µ θ is covex, we have Further, we have for ay L L µ, Q µ,λ T L,λ θ, θ mi x F µ,λ x + L x θ. A.5 F µ,λ T L,λ θ Q µ,λ T L,λ θ, θ F µ,λ θ L T L,λθ θ. A.6 The ext lemma provides a upper boud of the optimal residue ω. Lemma A.9 Adapted from Lemma i Xiao ad Zhag 03. For ay L > 0, if L µ is the Lipschitz costat of L µ, the ω λ T L,λ θ L + S L θ T L,λ θ θ L + L µ T L,λ θ θ, where S L θ = LµT L,λθ L µθ T L,λ θ θ is a local Lipschitz costat, which satisfies S L θ L µ. B Proof of Theorem 3.4 We first demostrate the subliear rate for iitial stages whe the estimatio error θ θ is large due large regularizatio parameter λ K. The proof is provided i Appedix F Theorem B.. For ay K =,..., N ad λ K > 0, let θ = argmi θ F µ,λk θ be the optimal solutio of K-th stage with regularizatio parameter λ K. For ay θ R d that satisfies F µ,λk θ F µ,λk θ 0, let θ θ. If the iitial value θ 0 satisfies ω λ K θ 0 λ K/, the withi K-th stage, for ay t =,,..., we have F µ,λ θ t F µ,λθ 4 X R t + µ, B. To achieve the approximate KKT coditio ω λk θ t ε K, the umber of proximal gradiet steps is o more tha 8 X 3 R ε K µ 3/ 3/4 L µ 3. B. Note that L µ X /µ X. The B. ca be simplified as O R ε K µ. Next, we demostrate the liear rate whe the estimator satisfies θ B s + s r. The proof is provided i Appedix G. Theorem B.. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ K > 0 for ay K = N,..., N. Let θ = argmi θ F µ,λk θ be the optimal solutio of K-th stage with regularizatio 8

19 parameter λ K. If the iitial value θ 0 satisfies ω λ K θ 0 λ K/ with θ 0 S 0 s, the withi K-th stage, for ay t =,,..., we have θ t S 0 s, θ t θ F µ,λk θ t F µ,λ K θ 8κ s + s 8κ s + s t 4λK s ω λk θ t ρ s + s ad t 4λK s ω λk θ t ρ s + s, B.3 For K = N,..., N, to achieve the approximate KKT coditio ω λk θ t λ K/4, the umber of proximal gradiet steps is o more tha log κ s + s s κ s + s. B.4 log 8κ s + s/8κ s + s For K = N, to achieve the approximate KKT coditio ω λn θ t [N] ε N, the umber of proximal gradiet steps is o more tha log 96 + κ s + s λ N s κ s + s/ε N. B.5 log 8κ s + s/8κ s + s From basic iequalities, sice κ s + s, we have 8κs log + s log +. 8κ s + s 8κ s + s 8κ s + s The B.4 ad B.5 ca be simplified as O κ s + s log κ 3 s + s s ad O κ s + s log κ 3 s + s λ N s /ε N respectively. As ca be see from Theorem B., whe the iitial value θ 0 satisfies the approximate KKT coditio ω λk θ 0 λ K/ with θ 0 Bs + s r, the we ca guaratee the geometric covergece rate of the estimated objective value towards the miimal objective. Next, we show that if the the optimal solutio θ [K ] from K -th path followig stage satisfies the approximate KKT coditio ad the regularizatio parameter λ K i the K-th path followig stage is chose properly, the θ [K ] satisfies the approximate KKT coditio for λ K with a slightly larger boud. This characterizes that good computatioal properties are preserved by usig the warm start θ 0 = θ [K ] ad geometric sequece of regularizatio parameters λ K. We formalize this otio i Lemma B.3, ad its proof is deferred to Appedix H. Lemma B.3. Let θ [K ] be the approximate solutio of K -th path followig state, which satisfies the approximate KKT coditio ω λk θ [K ] λ K /4. The we have where λ K = ηλ K with η 5/6,. ω λk θ [K ] λ K /, 9

20 Combiig Theorem B. ad Lemma B.3, we ca achieve the global covergece i terms of the objective value usig the path followig proximal gradiet method. We have the bouds of iteratios T K i Phase directly from B.4 ad B.5 of Theorem B.. Fially, we obtai the objective gap of N-th stage via aalogous argumet. Specifically, i the N-th fial path followig stage, whe the umber of iteratios for proximal method is large eough such that ω λn θ t [N] ε N holds, the we obtai the result from Lemma A.5 with λ = λ = λ N. Fially, we eed to show that there exists some N {,..., N} such that θ [N ] B s + s r. We demostrate this result i Lemma B.4 ad provide its proof i Appedix I. Lemma B.4. Suppose Assumptio 3. ad Assumptio 3.3 holds, ad the approximate KKT satisfies ω λ θ λ/4. If µ σ/4 ad ρ s + s 8 r s > λ > λ N, the we have θ θ r ad θ S 0 s. Lemma B.4 guaratees that there exits some N < N such that for some λ N > λ N, the approximate solutio θ [N ] satisfies the approximate KKT coditio ad θ [N ] B s + s r, the we eters the Phase of strog liear covergece by Theorem B.. Thus we fiish the proof. We further provide a boud of the total umber of proximal gradiet steps for each λ K i the followig lemma for iterested readers. The proof is provided i Appedix J. Lemma B.5. For each λ K, K =,..., N, if we restart the lie search with a large eough L max, the the total umber of proximal gradiet steps is o more tha T K ++max{log L max log ρ + s + s, 0}. C Proof of Lemma 3.6 Part. We first show that Assumptio 3. holds. By y = Xθ + ɛ ad C.4, we have L µ θ = X Xθ y max{ µ, y Xθ } = X ɛ max{ µ, ɛ }. C. Sice ɛ has i.i.d. sub-gaussia etries ad E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,, the we have from Waiwright 05 that P [ ɛ 4 ] σ exp, C. 3 By Negahba et al. 0, we have the followig result. Lemma C.. Assume X satisfies x j for all j =,..., d ad ɛ has i.i.d. zero-mea sub-gaussia etries with E[wi ] = σ for all i =,...,, the we have [ ] log d P X ɛ σ d. 0

21 Combiig C., C. ad Lemma C., we have with probability at least d exp 3, L µ θ 4 log d/ max{ µ/ σ, }. Part. Next, we show that Assumptio 3.3 holds. We divide the proof ito two steps. Step. Whe X satisfies the RE coditio, i.e. ψ mi v log d ϕ mi v Xv ψ max v + ϕ max log d v, Deote s = s + s. Sice v 0 s, which implies v s v, the we have s log d ψ mi ϕ mi v Xv s log d ψ max + ϕ max v, The there exists a uiversal costat c such that if c s log d, we have ψ mi v Xv ψ max v. C.3 Step. Coditioig o C.3, we show that L µ satisfies LRSC ad LRSS with high probability. The gradiet of L µ θ is y L µ θ = Xθ µ y Xθ = y Xθ θ The Hessia of L µ θ is L µ θ = X z = θ X X µ, if y Xθ < µ y Xθ X X Xθ y max{ µ, y Xθ }. I y Xθy Xθ y Xθ X, o.w. C.4 C.5 For otatioal coveiece, we defie = v w for ay v, w B s. Also deote the residual of the first order Taylor expasio as δl µ w +, w = L µ w + L µ w L µ w. Usig the first order Taylor expasio of L µ θ at w ad the Hessia of L µ θ i C.5, we have from mea value theorem that there exists some α [0, ] such that X X µ, if ξ < µ δl µ w +, w = ξ X I ξξ X, o.w. ξ where ξ = y Xw +α. For otatioal simplicity, let s deote ż = Xv θ ad z = Xw θ, which ca be cosidered as two fixed vectors i R. Without loss of geerality, assume ż z. The we have ż z ψ max w θ σ 4.

22 Further, we have ξ = y Xw + α = ɛ Xw + α θ = ɛ αż α z, ad X = ż z. We have from Waiwright 05 that P [ ɛ σ δ ] exp The by takig δ = /3 i C.6, we have with probability exp 44, ξ ɛ α ż α z ɛ z 4 5 δ, C.6 6 σ σ σ. 4 We first discuss the RSS property. From C.7, we have ξ µ, the from C.7 we have δl µ w +, w = X I ξξ ξ ξ X = X ξ X ξ ξ X 8ψ max ξ σ C.7 Next, we verify the RSC property. From C.7, we have ξ µ. We wat to show that with high probability, for ay costat a 0, ξ X ξ a X. C.8 Cosequetly, we have X I ξξ ξ ξ X = X X a X ξ. This further implies δl µ w +, w = X I ξξ ξ ξ X aψ mi ξ /. C.9 Sice ż z, the for ay real costat a 0,, [ ξ P X ξ ] a X [ ɛ αż α z = P ż z ɛ αż α z ] a ż z [ i ɛ ż ż z P ɛ ż ] a ż z ] [ = P ɛ ż z ż ż z a ɛ ż ż z [ ii ɛ ] ż z = P + ż ɛ ż ż z a ɛ + ż ɛ ż, C.0

23 where ii is from dividig both sides by v, ad i is from a geometric ispectio ad the radomess of ɛ, i.e., for ay α [0, ] ad ż z, ż ż z ż αż α z ż z αż α z. The radom vector ɛ with i.i.d. etries does ot affect the iequality above. Let s first discuss oe side of the probability i C.0, i.e., [ ɛ ] ż z P + ż ɛ ż a ɛ + ż ɛ ż ż z [ ɛ = P a ɛ ] ż z + a ż ɛ ż. C. ż z Sice ɛ has i.i.d. sub-gaussia etries with E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,, the ɛ ż z ż z ad ɛ ż are also zero-mea sub-gaussias with variaces σ ad σ ż respectively. We have from Waiwright 05 that P [ ɛ σ δ ] exp δ, C. 6 [ ɛ ] ż z P σ δ exp δ, C.3 ż z [ ] P ɛ ż σ δ exp σ δ ż. C.4 Combiig C. C.4 with ż σ /4, we have from uio boud that with probability at least exp 44 exp 8 exp 8 3 exp 44, ɛ 3 σ, ɛ ż z ż z 64 σ, ɛ ż 6 σ. This implies for a 3/5, we have ξ ξ X a X. For the other side of the probability i C.0, we have [ ɛ ż z P ż z [ ɛ = P a ɛ ż z ż z [ ɛ P a ɛ ż z ż z + ż ɛ ż a ɛ + ż ɛ ż a ż ɛ ż + a ż ɛ ż ] ] ]. C.5 3

24 Combiig C.0, C. ad C.5, we have C.8 holds with high probability, i.e., for ay r > 0, [ ξ P X ξ ] a X 6 exp. 44 Now wo boud ξ to obtai the desired result. From Waiwright 05, we have P [ ɛ σ + δ ] exp δ = exp, C where we take δ = /. From ξ = ɛ αż α z, we have i ii ξ ɛ + α ż + α z ɛ + z 3 σ + σ. C.7 where i is from ż z ad ii is from C.6 ad ż σ /4. The by the uio boud settig a = /, with probability at least 7 exp 44, we have δl µ w +, w ψ mi 8σ. Moreover, we also have r = σ 8ψ max > s 64σλ N /ψ mi s 8λ N /ρ s + s for large eough c s log d, where λ N λ N = 48 log d/. The choice of the costat i λ N λ N is somewhat arbitrary, which ca be ay fixed costat larger tha /η such that the existece of λ N is guarateed. D Proof of Theorem 3.7 Part. We first show that estimatio errors are as claimed. Sice θ is the approximate solutio of K-th path followig stage, it satisfies ω λk θ λ K /4 λ K+ / for t [N +, T ], the we have from Lemma B.3 that ω λk+ θ 0 [K+] λ K+/. By Theorem B., we have for ay t =,,..., Applyig Lemma A. recursively, we have θ [N] θ λ N s θ t [K+] S 0 s. ρ s + s ad θ [N] θ λ Ns. ρ s + s Applyig Lemma 3.6 with λ N = 4 log d/ ad ρ s + s = ψ mi 8σ, the by uio boud, with probability at least 8 exp 44 d, we have θ [N] θ 384σ s log d/ ψ mi, θ [N] θ 304σs log d/ ψ mi. 4

25 Part. Next, we demostrate the result of the estimatio of variace. Let θ [N] = argmi θ F µ,λn θ be the optimal solutio of K-th stage. Apply the argumet i Part recursively, we have Deote c, c,... as positive uiversal costats. The we have θ [N] θ 304σs log d/ ψ mi. D. L µ θ [N] L µ θ λ N θ θ [N] λ N θ S θ [N] S θ [N] S λ N θ [N] θ S λ N θ [N] θ ii σs log d c, D. where i is from the value of λ N ad l error boud i D.. O the other had, from the covexity of L µ θ, we have L µ θ [N] L µ θ θ [N] θ L µ θ L µ θ θ [N] θ i ii σs log d c λ N θ [N] θ c 3, D.3 where i is from Assumptio 3. ad ii value of λ N ad l error boud i D.. For our choice of µ ad, we have L µ θ = y Xθ µ by Propositio 3.8, the L µ θ [N] L µ θ = y Xθ [N] ɛ. D.4 From Waiwright 05, we have for ay δ > 0, [ ɛ ] P σ σ δ exp δ. D.5 8 Combiig D., D.3, D.4 ad D.5 with δ = c 3s log d, we have with high probability, y Xθ [N] σs σ = O log d. D.6 From Part, for c 4 s log d, we have with high probability, θ [N] θ 384σ s log d/ ψ mi σ ψ max, the θ [N] B s + s r ad θ [N] θ [N] 0 s + s. The from the aalysis of Theorem B., we have ω λk θ t+ + κ s + s 4ρ + s + s F µ,λk θ t F µ,λ K θ ε N. This implies F µ,λk θ t F µ,λ K θ ɛ N 4ρ + s + s + κ s + s. D.7 5

26 O the other had, from the LRSC property of L µ, covexity of l orm ad optimality of θ, we have F µ,λk θ t F µ,λ K θ ρ s + s θ [N] θ [N]. D.8 Combiig D.7, D.8 ad Assumptio 3.5, we have X θ [N] θ [N] 8ρ + s + s θ σ [N] θ Combiig D.6 ad D.9, we have y X θ [N] σρ s + s ɛ N + κ s + s 4ɛ N + κ s + s ψ mi. D.9 y Xθ [N] + X θ [N] θ [N] y Xθ [N] + 4ɛ N + κ s + s ψ mi. If ɛ N c 5 σs log d for some costat c 5, the we have the desired result. E Proof of Propositio 3.8 Let θ ad θ be the uique global optima of.3 ad.4 respectively for all K = N +,..., N. The, we show θ = θ uder the proposed coditios. Apply the argumet of the proof of Theorem 3.7 recursively, we have with probability at least 8 exp 44 d, θ θ 384σ s log d/ ψ mi. By SE coditio of X i Assumptio 3.3, this implies O the other had, we have Xθ θ 384σ ψ max s log d ψ mi. E. y Xθ = Xθ θ + ɛ ɛ Xθ θ. E. Sice ɛ has i.i.d. sub-gaussia etries with E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,, we have from Waiwright 05 that P [ ɛ 3 ] σ exp, E

27 Combiig E. ad E., E.3 ad the coditio o c 4 s log d for some costat c 4, we have with probability at least 9 exp 44 d, y Xθ σ 3 384σ ψmax s log d/ σ > µ. 4 This implies F µ,λ θ = F µ θ + µ, thus argmi θ F µ,λ θ = argmi θ F µ θ, i.e., θ T = θ. Besides this also implies y Xθ > σ 4 µ, i.e., θ is ot i the smoothed regio. Applyig the same argumet agai to θ, we have that for large eough, with high probability, ψ mi y X θ > σ 4 µ, ad r = σ 8ψ max > s 64σλ N /ψ mi s 8λ N /ρ s + s is guarateed, where λn λ N = 48 log d/. By Lemma B.4, this implies the existece of the liear covergece regio, which does ot fall ito the smoothed regio. Besides, θ is ot i the smoothed regio. F Proof of Theorem B. The sub-liear rate of covergece B. follows directly from Lemma A.7. I terms of the optimal residual, we have ωλ K θ t+ i L t + S Lt θ t θ t+ θ t ii iii iv + X / t+ µ θ θ t Lt + X / µ Lt k m + 8 X 4 k m + µ Fµ,λ m K θ F µ,λ K θ t+ m k i=m F µ,λ K θ m F µ,λ K θ m+ L µ mi i [m,...,k] { L i } 8 X 4 k m + µ Fµ,λ K θ F µ,λ K θ v 7R X 6 L µ k m + m + 3/ µ 3 L µ vi 88R X 6 L µ k + 3 3/ µ 3, F. where i is from Lemma A.9, ii is from S Lt θ t L µ X / µ i Lemma A.9 ad t Lemma A.7, iii is from A.6 i Lemma A.8, iv is from L µ L L µ X / µ i Remark A. ad Lemma 3.6, v is from Lemma A.7 ad vi is obtaied by choosig m = k/. To achieve the approximate KKT coditio ω λk θ t ε K, we require the R.H.S. of F. to be o greater tha ε K, the we have the desired result B.. 7

28 G Proof of Theorem B. Note that the RSS property implies that lie search termiate whe L t satisfies ρ + s + s L t ρ+ s + s. G. Sice the iitializatio θ 0 satisfies ω λ K θ 0 λ K/ with θ 0 S 0 s, the by Lemma A., the objective satisfies The by Lemma A.4, we have F µ,λk θ 0 F µ,λθ 6λ K s. θ S 0 s. ρ s + s By mootoe decrease of F µ,λk θ t from A.6 i Lemma A.8 ad recursively applyig Lemma A.4, θ t S 0 s holds i B.3 for ay t =,,.... For the objective error, we have F µ,λk θ t F µ,λ K θ i ii 8κ s + s 8κ s + s t F µ,λk θ0 F µ,λ K θ t 4λK s ω λk θ t ρ s + s, G. where i is from Lemma A.6, ad ii is from Lemma A.5 with λ = λ = λ K ad ω λk θ t+ λ K / λ K, which results i B.3. Combiig G. ad A.4, we have θ t θ ρ F µ,λk θ t F µ,λ K θ F µ,λk θ s + s = ρ F µ,λk θ t F µ,λ K θ s + s 8κ s + s t 4λK s ω λk θ t ρ s + s 8

29 For the optimal residue ω λk θ t+ of t + -th iteratio of K-th path followig stage, we have ω λk θ t+ i L t + S Lt iii iv L t L t θ t + ρ+ s + s ρ s + s + ρ+ s + s ρ s + s v + κ s + s vi + κ s + s θ t+ θ t+ 4ρ + s + s 96λ K s κ s + s θ t ii Lt + ρ+ s + s θ t+ θ t θ t F µ,λk θ t F µ,λ K θ t+ L t F µ,λk θ t F µ,λ K θ 8κ s + s t, G.3 where i is from Lemma A.9, ii is from S Lt θ t ρ+ s + s, iii is from ρ t s + s L i G., t iv is from A.6 i Lemma A.8, v is from L ρ+ s + s i G. ad mootoe decrease of F µ,λk θ t from A.6 i Lemma A.8, ad vi is from G. ad κ s + s = ρ+ s + s. ρ s + s For K-th path followig stage, K =,..., N, to have ω λk θ t+ λ K /4, we set the R.H.S. of G.3 to be o greater tha λ K /4, which is equivalet to require the umber of iteratios k to be a upper boud of B.4. For the last N-th path followig stage, we eed ω λn θ [N] ε N λ N /4. Set the R.H.S. of G.3 to be o greater tha ε N, which is equivalet to require the umber of iteratios k to be a upper boud of B.5. H Proof of Lemma B.3 Sice ω λk θ [K ] λ K /4, there exists some subgradiet g θ [K ] such that L µ θ [K ] + λ K g λ K /4. H. By the defiitio of ω λk, we have ω λk θ [K ] L µ θ [K ] + λ K g = L µ θ [K ] + λ K g + λ K λ K g L µ θ [K ] + λ K g + λ K λ K g i λ K /4 + ηλ K ii λ K /, where i is from H. ad choice of λ K, ii is from the coditio o η. 9

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short