A First Order Free Lunch for SQRT-Lasso

Size: px
Start display at page:

Download "A First Order Free Lunch for SQRT-Lasso"

Transcription

1 A First Order Free Luch for SQRT-Lasso Xigguo Li, Jarvis Haupt, Rama Arora, Ha Liu, Migyi Hog, ad Tuo Zhao arxiv: v [cs.lg] 5 May 06 Abstract May statistical machie learig techiques sacrifice coveiet computatioal structures to gai estimatio robustess ad modelig flexibility. I this paper, we study this fudametal tradeoff through a SQRT-Lasso problem for sparse liear regressio ad sparse precisio matrix estimatio i high dimesios. We explai how ovel optimizatio techiques help address these computatioal challeges. Particularly, we propose a pathwise iterative smoothig shrikage thresholdig algorithm for solvig the SQRT-Lasso optimizatio problem. We further provide a ovel model-based perspective for aalyzig the smoothig optimizatio framework, which allows us to establish a early liear covergece R-liear covergece guaratee for our proposed algorithm. This implies that solvig the SQRT-Lasso optimizatio is almost as easy as solvig the Lasso optimizatio. Moreover, we show that our proposed algorithm ca also be applied to sparse precisio matrix estimatio, ad ejoys good computatioal properties. Numerical experimets are provided to support our theory. Itroductio Give a desig matrix X R d ad a respose vector y R, we cosider a liear model y = Xθ + ɛ, where θ R d is a ukow coefficiet vector, ad ɛ R is a radom oise vector with i.i.d. sub-gaussia etries, E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,. We are iterested i estimatig θ i high dimesios where /d 0. A popular assumptio i high dimesios is that oly a small subset of variables are relevat i modelig, i.e., may etries of θ are zero. To get such a sparse estimator, Tibshirai 996 proposed Lasso, which solves θ = argmi θ y Xθ + λ Lasso θ,. where λ Lasso is the regularizatio parameter ad ecourages the solutio sparsity. The statistical properties of Lasso have bee established i Zhag ad Huag 008; Zhag 009; Bickel et al. Xigguo Li ad Jarvis Haupt are affiliated with Departmet of Electrical ad Computer Egieerig at Uiversity of Miesota, Mieapolis, MN, 55455, USA; Rama Arora ad Tuo Zhao is affiliated with Departmet of Computer Sciece at Johs Hopkis Uiversity Baltimore, MD, 0, USA; Ha Liu is affiliated with Departmet of Operatios Research ad Fiacial Egieerig at Priceto Uiversity, Priceto, NJ 08544, USA; Migyi Hog is affiliated with Departmet of Idustrial ad Maufacturig Systems Egieerig at Iowa State Uiversity. s: lixx66@um.edu, jdhaupt@um.edu, arora@cs.jhu.edu, haliu@priceto.edu, migyi@iastate.edu, tour@cs.jhu.edu

2 009; Negahba et al. 0. I particular, give λ Lasso σ log d/, the Lasso estimator i. attais the miimax optimal rates of covergece i parameter estimatio, θ θ = O P σ s log d/,. where s deotes the umber of ozero etires i θ Ye ad Zhag, 00; Raskutti et al., 0. Despite these favorable properties, the Lasso approach has a sigificat drawback: The selected regularizatio parameter parameter λ Lasso liearly scales with the ukow quatity σ. Therefore, we eed to carefully tue λ Lasso over a wide rage of potetial values i order to get a good fiite-sample performace. To overcome this drawback, Belloi et al. 0 proposed SQRT-Lasso, which solves θ = argmi y Xθ + λ SQRT θ..3 θ R d They further show that SQRT-Lasso require o prior kowledge of σ. Choosig λ SQRT log d/, the SQRT-Lasso estimator i.3 attais the same optimal statistical rate of covergece i parameter estimatio as.. This meas that the regularizatio selectio for SQRT-Lasso does ot scale with σ. We ca easily specify a smaller rage of potetial values for tuig λ SQRT tha Lasso. Besides estimatig θ, SQRT-Lasso ca also estimate σ, which further makes it applicable to sparse precisio matrix estimatio; this is ot the case with Lasso. Specifically, give observatios i.i.d. sampled from a d-variate ormal distributio with mea 0 ad a sparse precisio matrix Θ, Liu ad Wag 0 proposed a estimator based o SQRT-Lasso See more details i 4, ad showed that it attais the miimax optimal statistical rate of covergece i parameter estimatio Θ Θ = O P Θ s log d/, where Θ deotes the spectral orm of Θ i.e., the largest sigular value of Θ, ad s deotes the maximum umber of ozero etries i each colum of Θ i.e. max l Θ jl 0 s. Though SQRT-Lasso simplifies us tuig efforts ad achieves the optimal statistical properties for both sparse liear regressio ad sparse precisio matrix estimatio i high dimesios, the optimizatio problem i.3 is computatioally more challegig tha. for Lasso, because the l loss i SQRT-Lasso does ot have the same ice computatioal structures as the least square loss i Lasso. For example, the l loss ca be odifferetiable, ad does ot have a Lipschitz cotiuous gradiet. Belloi et al. 0 coverted.3 to a secod order coe optimizatio problem, ad further solved it by a iterior poit method; Li et al. 05 the solved.3 by a ADMM algorithm. Neither of them, however, ca scale to large problems. I cotrast, Xiao ad Zhag 03 proposed a efficiet pathwise iterative shrikage thresholdig algorithm PISTA for solvig., which attais a liear covergece to the uique sparse global optimum with high probability. To address this computatioal challege, we propose a pathwise iterative smoothig shrikage thresholdig algorithm PIS TA to solve.3. Specifically, we first apply the cojugate dual The otatio O P is defied i Lie 84 o Page 3

3 smoothig approach to the osmooth l loss Nesterov, 005; Beck ad Teboulle, 0, ad obtai a smooth surrogate deoted by y Xθ µ, where µ > 0 is a smoothig parameter See more details i. We the apply PISTA to solve the partially smoothed optimizatio problem as follows: θ = argmi y Xθ µ + λ SQRT θ..4 θ R d Existig computatioal theory guaratees that our proposed PIS TA algorithm attais a subliear covergece to the global optimum i term of the objective value Nesterov, 005. However, our umerical experimets show that PIS TA achieves far better empirical computatioal performace better tha subliear covergece for solvig SQRT-Lasso, ad is sigificatly more efficiet tha other competig algorithms, ad early as efficiet as Xiao ad Zhag 03 for solvig Lasso. This is because the existig computatioal aalyses of the cojugate dual smoothig approach do ot take certai specific modelig structures ito cosideratio. For example: I The l loss is oly osmooth whe all residuals are equal to zero sigificatly overfitted. But this is very ulikely to happe because we are solvig.3 with a sufficietly large regularizatio; II Although the smoothed l loss is ot strogly covex, if we restrict the solutio to a sparse domai, the smoothed l loss ca behave like strogly covex fuctios over a eighborhood of θ. Motivated by these observatios, we establish a ew computatioal theory for PIS TA, which exploits the above modelig structures. Particularly, we show that PIS TA achieves a early liear covergece R-liear covergece to the uique sparse global optimum for solvig.3 with high probability, ad also gives us a well fitted model. There are two implicatios: I We ca solve the SQRT-Lasso optimizatio as early efficietly as solvig the Lasso optimizatio; II We pay almost o price i optimizatio accuracy whe usig the smoothig approach for solvig the SQRT-Lasso optimizatio, because.4 ad.3 share the same uique sparse global optimum with high probability. As a extesio of our theory for the SQRT-Lasso optimizatio, we further aalyze the computatioal properties of our proposed PIS TA algorithm for sparse precisio matrix estimatio i high dimesios. We show that PIS TA also achieves a R-liear covergece to the uique sparse global optimum with high probability. We provide umerical experimets o simulated ad real data to support our theory. All proofs of our aalysis are deferred to the supplemetary material. Notatios: Give a vector v = v,..., v d R d, we defie vector orms: v = j v j, v = j v j, ad v = max j v j. We deote the umber of ozero etries i v as v 0 = j v j 0. We deote v \j = v,..., v j, v j+,..., v d R d as the subvector of v with the j-th etry removed. Let A {,..., d} be a idex set. We use A to deote the complemetary set to A, i.e. A = {j j {,..., d}, j / A}. We use v A to deote a subvector of v by extractig all etries of v with idices i A. Give a matrix A R d d, we use A j = A j,..., A dj to deote the j-th colum of A, ad A k = A k,..., A kd to deote the k-th row of A. Let Λ max A ad Λ mi A be the largest ad smallest eigevalues of A. We defie A F = j A j ad A = Λ max A A. We deote A \i\j as the submatrix of A with the i-th row ad the j-th colum removed. We deote A i\j as the i-th row of A with its j-th etry removed. Let A {,..., d} 3

4 be a idex set. We use A AA to deote a submatrix of A by extractig all etries of A with both row ad colum idices i A. We deote A 0 if A is a positive-defiite matrix. Give two real sequeces {A }, {a }, A = Oa or A = Ωa if ad oly if M R + ad N N such that A M a or A M a for all N. A a if A = Oa ad A = Ωa simultaeously. A = O P a if δ 0,, M R + ad N δ N such that P[ A > M a ] < δ for all N δ. A = oa if δ > 0, N δ N such that A δ a for all N δ, i.e., lim A /a = 0. Give a vector x R d ad a real value λ > 0, we deote the soft thresholdig operator S λ x = [sigx j max{ x j λ, 0}] d j=. Algorithm Our proposed algorithm cosists of three compoets: I Cojugate Dual Smoothig, II Iterative Shrikage Thresholdig Algorithm ISTA, ad III Pathwise Optimizatio. I The Cojugate Dual Smoothig approach is adopted to obtai a smooth surrogate of l loss Nesterov, 005; Beck ad Teboulle, 0. We deote the smoothed l loss fuctio as y Xθ µ = max z z y Xθ µ z.. The optimizatio problem i. admits a closed form solutio: L µ θ = y Xθ µ = { µ y Xθ, if y Xθ < µ y Xθ µ, o.w.. We preset several two-dimesioal examples of the smoothed l orm usig differet µ s i Figure. A larger µ itroduces a larger approximatio error, but makes the approximatio smoother. We the cosider the followig partially smoothed optimizatio problem, θ = argmi F µ,λ θ, where F µ,λ θ = L µ θ + λ θ.. θ R d a µ = 0 b µ = 0. c µ = 0.5 d µ = Figure : Examples of x µ = 0 ad x µ with µ = 0., 0.5, ad respectively for x R. II The ISTA Algorithm is applied to solve. Nesterov, 03. Particularly, give θ t at t-th iteratio, we cosider the quadratic approximatio of F µ,λ θ at θ = θ t, Q µ,λ θ, θ t = L µ θ t + L µ θ t θ θ t + Lt θ θt + λ θ,.3 4

5 where L t is a step size parameter determied by the backtrackig lie search. We the take θ t+ = argmi Q µ,λ θ, θ t = S λ/l tθ t L µ θ t /L t, θ For simplicity, we deote θ t+ = T L t+,λ θt. Give a pre-specified precisio ε, we termiate the iteratios whe the approximate KKT coditio holds: ω λ θ t = mi L µ θ t + λg ε. g θ t III The Pathwise Optimizatio is essetially a multistage optimizatio scheme for boostig computatioal performace. We solve. usig a geometrically decreasig sequece of regularizatio parameters λ >... > λ N, where λ N = λ SQRT. This yields a sequece of output solutios θ [],..., θ [N] from sparse to dese. Particularly, at the K-th optimizatio stage, we choose θ [K ] the output solutio of the K -th stage as the iitial solutio, i.e., θ 0 = θ [K ], ad solve. with λ = λ K usig the ISTA algorithm. This is also referred as the warm start iitializatio i existig literature. We summarize our approach i Algorithm. 3 Computatioal ad Statistical Aalysis We first defie the locally restricted strog covexity ad smoothess. Defiitio 3.. Give a costat r R +, let B r = {θ R d : θ θ r}. For ay v, w B r, which satisfies v w 0 s, L µ is locally restricted strogly covex LRSC ad smooth LRSS o B r at sparsity level s if there exist uiversal costats ρ s, ρ + s 0, such that ρ s v w L µ v L µ w L µ w v w ρ+ s v w, 3. We defie the locally restricted coditio umber at sparsity level s as κ s = ρ + s /ρ s. The LRSC ad LRSS properties are locally costraied variats of restricted strog covexity ad smoothess Agarwal et al., 00; Xiao ad Zhag, 03 with respect to a eighborhood of the true model parameter θ, which are keys to establishig the strog covergece guaratees of our proposed algorithm i high dimesios. Next, we itroduce two key assumptios for establishig our computatioal theory. Assumptio 3.. The sequece of the regularizatio parameters satisfies λ N 6 L µ θ. Assumptio 3. requires that λ N is sufficietly large such that the irrelevat variables ca be elimiated Bickel et al., 009; Negahba et al., 0. Assumptio 3.3. L µ satisfies LRSC ad LRSS properties o B r, where r s 8λ N /ρ s + s for some N < N, N Z +, ad λ N > λ N. Specifically, 3. holds with ρ + s + s, ρ s + s 0,, where s = C s > 96κ s + s + 44κ s + ss, C R + is a costat ad κ s + s = ρ + s + s /ρ s + s. Assumptio 3.3 guaratees that L µ satisfies LRSC ad LRSS properties as log as the estimatio error satisfies θ θ r ad the umber of irrelevat coordiates of solutios is bouded by s. 5

6 Algorithm : Pathwise Iterative Smoothig Shrikage Thresholdig Algorithm PIS TA for solvig the SQRT-Lasso optimizatio.4. θ deotes the output solutio correspodig to λ K ; θ t deotes the solutio at the t-th iteratio of the K-th optimizatio stage; ε K is a pre-specified precisio for the K-th optimizatio stage. The lie search procedure is describe i Algorithm. Iput: y, X, N, λ N, ε N, L max > 0 Iitialize: θ [0] 0, λ 0 L µ 0, η λ N /λ 0 /N For: K =,..., N t 0, λ K ηλ K, θ 0 θ [K ], L 0 L max Repeat: t t + L t t mi{ L, L max}, where θ t T L t,λ θ t K Util: ω λk θ t ε K θ θ t Ed For Retur: θ [N] L t LieSearch λ K, θ t, L t Algorithm : Lie search of PIS TA for SQRT-Lasso. Iput: λ K, θ t, L t Iitialize: t L = Lt Repeat: θ t = T Lt,λ θ t K If F µ,λk θ t < Q µ,λ K θ t, θt L t t = L / Ed If Util: F µ,λk θ t Q µ,λ K θ t, θt Retur: L t 3. Computatioal Theory Our aalysis cosists of two phases depedig o the estimatio error ad sparsity of the solutio θ alog the path. Specifically, we deote B s + s r = B r {θ R d : θ θ 0 s + s}. Let N {,..., N} be a cut-off betwee Phase I ad Phase II. We ca show that Phase I correspods to the first N stages of pathwise optimizatio, i which we caot guaratee θ / B s + s r. Thus we oly establish a subliear covergece for Phase I. But Phase I is still computatioally efficiet, 6

7 sice we ca choose reasoably large ε K for K =,..., N to facilitate early stoppig; Phase II correspods to the cosequet N N stages, i which we guaratee θ B s + s r. Thus LRSC ad LRSS hold, ad a liear covergece ca be established accordigly. Theorem 3.4. Suppose Assumptios 3. ad 3.3 hold. Let θ be the output solutio that satisfies ω λk θ ε K of the K-th stage respectively for all K =,..., N. We deote S = {j θj 0}, S = {j θj = 0} ad s = S. Recall that η is the decayig ratio of the geometrically decreasig regularizatio sequece. Give µ σ 4, λ 0 = L µ 0, ad η = λ N /λ 0 /N 5/6,, there exists N {,,..., N} such that the followig results hold: Phase I: Let R = max K N {sup θ θ θ : F µ,λk θ F µ,λk θ 0 }. At the K-th stage, where K =,..., N, we eed at most T K = O iteratios to guaratee X R ε K µ ω λk θ t ε K ad F µ,λk θ F µ,λk θ = O X R T K µ where θ is a global optimum to.4. Moreover, we have θ [N ] B r ad [ θ [N ]] S 0 s + s. Phase II: Let α = 8κ. At the K-th stage, where K = N s + s +,..., N, we have sparse solutios throughout all iteratios, i.e., [θ t ] S c 0 s + s. Moreover, we eed at most T K = O κ s + s log iteratios to guaratee ω λk θ t ε K, κ 3 s + s s λ K ε K F µ,λk θ F µ,λk θ = O α T K ε K λ K s, ad θ θ = O α T K ε K λ K s, where θ is the uique sparse global optimum to.4 with λ K satisfyig [θ ] S 0 s + s. Theorem 3.4 guaratees that PIS TA achieves a R-liear covergece to the uique sparse global optimum to.4 i terms of both objective value ad solutio parameter, which is as early efficiet as Xiao ad Zhag 03 for solvig Lasso with much less turig effort sice λ N is idepedet of σ. This further explais why PIS TA is much more efficiet tha other competig algorithm for solvig SQRT-Lasso such as ADMM ad SOCP + iterior poit method. A geometric iterpretatio of Theorem 3.4 is provided i Figure. The first N stages serve as itermediate processes to facilitate fast covergece to θ [N], which do ot require high precisio solutios. Thus we choose ε K = λ K /4 ε N for K =,..., N such that Phase I is efficiet, as show i Figure 3, ad oly high precisio for the last stage, e.g. ε N = 0 5 because oly the last regularizatio parameter is of our iterest. The total umber of iteratios for computig the etire solutio path is at most N X O R λ N µ + κ s + sn N logκ s + ss λ N /ε N. Moreover, give properly chose µ ad λ N, we guaratee that oe of the liear covergece regio, true model parameter θ ad output solutio θ [N] fall ito the smooth regio. This implies that.3 ad.4 share the same global optimum i Phase II. A formal claim is preseted i 3.., 7

8 Neighborhood of : 8 Iitial Solutio >< Output Solutio for Phase I: Output Solutio for >: 8 Output Solutio for >< Output Solutio for Phase II: >:. Output Solutio for. B s + s r N N+ N b [0] = 0 b [] b [] b [N] b [N+] b [N] Smoothed Regio Overfitted Model Regio of Subliear Cov. Regio of Liear Cov. Figure : A geometric iterpretatio of two phases of covergece. Phase I yellow regio: Subliear covergece, ad Phase II orage regio: Liear covergece. The liear covergece regio, the true parameter θ ad output solutio θ [N] do ot fall ito the smoothed regio. Fµ, K t F µ, K ,..., N N t : # of Iteratios i Each Stage Figure 3: Plots of the objective gaps for all iteratios t of each path followig stage oly a few stages are demostrated for clarity. 3. Statistical Theory To aalyze the statistical properties of our estimator obtaied via PIS TA, we assume that the desig matrix X satisfies the restricted eigevalue coditio as follows. Assumptio 3.5. The desig matrix X satisfies the Restricted Eigevalue RE coditio, i.e., there exist costats ψ mi, ψ max, ϕ mi, ϕ max 0,, which do ot scale with s,, d, such that ψ mi v log d ϕ mi v Xv ψ max v + ϕ max log d v, 3. A wide family of examples satisfy the RE coditio, such as the correlated sub-gaussia radom desig Rudelso ad Zhou, 03, which has bee extesively studied for sparse recovery Cades ad Tao, 005; Bickel et al., 009; Raskutti et al., 00. We ext verify Assumptio 3. ad 3.3 based o the RE coditio i the followig lemma. 8

9 Lemma 3.6. Suppose Assumptio 3.5 holds. Give µ σ log d 4 ad λ N = 4, λ N 6 L µ θ with high probability. Moreover, for large eough, L µ θ satisfies LRSC ad LRSS properties o B r with high probability, where r = Specifically, 3. holds with ρ + s + s 8ψmax σ geeric costat ad κ s + s 64ψ max /ψ mi. σ 8ψ max. ad ρ s + s ψ mi 8σ, where s = C s > 96κ s + s + 44κ s + ss, C R + is a Lemma 3.6 guaratees that Assumptio 3. holds give properly chose µ ad λ N, ad Assumptio 3.3 holds give the desig X satisfyig RE coditio, both with high probability. Therefore, by Theorem 3.4, PIS TA achieves a R-liear covergece to the uique sparse global optimum. I the ext theorem, we characterize the statistical rate of covergece of PIS TA. Theorem 3.7. Suppose Assumptio 3.5 holds. Let the output solutio θ [N] satisfy ω λn θ [N] ε N for a small eough ε N. Give µ σ 4, λ log d N = 4 ad a large eough, we have: θ [N] θ = O P σ s log d/ ad θ [N] θ = O P σs log d/. Moreover, let σ = y X θ [N] be the estimatio of σ. The we have σ σ = O P σs log d/. Theorem 3.7 guaratees that the output solutio θ [N] obtaied by PIS TA achieves the miimax optimal rate of covergece i parameter estimatio Ye ad Zhag, 00; Raskutti et al., 0. The ext propositio shows that.3 ad.4 share the same global optimum, which correspods to a well fitted model. This implies that either the uique sparse global optimum θ [N] or the liear covergece regio falls ito the smooth regio, as show i Figure. Propositio 3.8. Uder the same assumptios as Theorem 3.7, for all λ K s, where K = N +,..., N,.3 ad.4 share the same uique sparse global optimum with high probability. 4 Extesio to Sparse Precisio Matrix Estimatio We cosider the TIGER approach proposed i Liu ad Wag 0 for estimatig sparse precisio matrix. To be clear, we emphasize that we use v, v,... without [ ] for subscript to idex vectors. Let X = [x,..., x ] R d be observed data poits from a d-variate Gaussia distributio N d 0, Σ. Our goal is to estimate the sparse precisio matrix Θ = Σ. Let Z = X Γ / = [z,..., z ] R d be the stadardized data matrix, where Γ = diag Σ,..., Σ dd is a diagoal matrix ad Σ = X X. The, we ca write z i = Z,\i θi θi / = Γ ii Γ / \i,\i Σ \i,\i Σ \i,i ad ɛ i N 0, σi I with σi, ad solve: deote τ i = σ i Γ ii / + Γ ii ɛ i for all i =,..., d, where = Σ ii Σ \i,i Σ \i,\i Σ \i,i. We θ i = argmi L µ,i θ i + λ θ i ad τ i = z i Z,\i θ i µ, 4. θ i R d 9

10 for all i =,..., d, where L µ,i θ i = z i Z,\i θ i µ. The the i-th colum of precisio matrix Θ is estimated by: Θ ii = τ i Γ ii, ad Θ \i,i = τ i Γ / ii Γ / \i,\i θ i. Here we solve 4. by PIS TA. We the itroduce a few mild techical assumptios: Assumptio 4.. Suppose the true covariace matrix Σ ad precisio matrix Θ satisfy: }A Θ Mκ Θ, s = {Θ R d d : Θ 0, Λ max Θ/Λ mi Θ κ Θ, max i j Θ ij 0 s ; A s log d = o; A3 lim sup max i Σ ii log d/ < /4. We first verify the assumptios required by our computatioal theory by the followig lemma. Lemma 4.. Suppose Assumptio 4. holds. Give µ 5 5 log d κ Θ ad λ N = 6, we have λ N 6 max i {,...,d} L µ,i θi with high probability. Moreover, L µ,i θ i satisfies LRSC ad LRSS properties o B ri for all i =,..., d with high probability, where r i = σ i κ Θ. Specifically, for all i =,..., d, 3. holds with ρ + s + s κ Θ σ i ad ρ s + s κ Θ σ i, where s = C 3 s > 96κ s + s + 44κ s + ss for a geeric costat C 3, ad κ s + s 44κ Θ. Lemma 4. guaratees that Assumptio 3. ad 3.3 holds with high probability give properly chose µ ad λ N. Thus, by Theorem 3.4, PIS TA achieves a R-liear covergece to the uique sparse global optimum of 4. with high probability for all colums of Θ. The ext theorem characterizes the statistical rate of covergece of the obtaied precisio matrix estimator usig PIS TA. Theorem 4.3. Suppose Assumptio 4. holds. Let Θ [N] be the output solutio for the regularizatio parameter λ N. Give µ 5 5 log d κ Θ ad λ N = 6, we have Θ [N] Θ = O P s Θ log d/. Theorem 4.3 implies that our obtaied precisio matrix estimator attais the miimax optimal rate of covergece i parameter estimatio. Moreover, we guaratee that either the liear covergece regio or output solutio Θ [N] falls ito the smooth regio with high probability. 5 Numerical Experimets We ivestigate the computatioal performace of the proposed PIS TA algorithm through umerical experimets over both simulated ad real data example. All simulatios are implemeted i C with double precisio usig a PC with a Itel 3.3GHz Core i5 CPU ad 6GB memory. For simulated data, we geerate a traiig dataset of 00 samples, where each row of the desig matrix X i,, i =,..., 00, idepedetly from a 000-dimesioal ormal distributio N0, Σ where Σ jj = ad Σ jk = 0.5 for all k j. We set s = 3 with θ = 3, θ =, ad θ 4 =.5, ad θj = 0 for all j,, 4. A validatio set of 00 samples for the regularizatio parameter selectio ad a testig set of 0, 000 samples are also geerated to evaluate the predictio accuracy. 0

11 We set σ = 0.5,,, 4 respectively to illustrate the tuig isesitivity. The regularizatio parameter of both Lasso ad SQRT-Lasso is chose over a geometrically decreasig sequece {λ K } 50 t=0 with λ 50 = σ log d// for Lasso ad λ 50 = log d// for SQRT-Lasso. The optimal regularizatio parameter is determied by λ opt = λ N as N = argmi K {0,...,50} ỹ X θ, where θ deotes the obtaied estimate usig the regularizatio parameter λ K, ad ỹ ad X deote the respose vector ad desig matrix of the validatio set. For both Lasso ad SQRT-Lasso, we set the stoppig precisio ε K = 0 5 for all K =,..., 30 ad ε K = 0 5 for all K = 3,..., 50. For SQRT-Lasso, we set the smoothig parameter µ = 0 4. First of all, we compare PIS TA with ADMM proposed i Li et al. 05. The backtrackig lie search described i Algorithm is adopted to accelerate both algorithms. We coduct 500 simulatios for all σ s. The results are preseted i Table. The PIS TA ad ADMM algorithms attai similar objective values, but PIS TA is about 0 times faster tha ADMM. Both algorithms also achieve similar estimatio errors. Throughout all 500 simulatios, we have y X θ [ N] µ This implies that all obtaied optimal estimators are outside the smoothed regio of the optimizatio problem, i.e., the smoothig approach does ot hurt the solutio accuracy. Next, we compare the computatioal ad statistical performace betwee Lasso solved by PISTA Xiao ad Zhag 03 ad SQRT-Lasso solved by PIS TA. The results averaged over 500 simulatios are summarized i Tables. I terms of statistical performace, Lasso ad SQRT- Lasso attais similar estimatio ad predictio error. I terms of computatioal performace, the PIS TA algorithm for solvig SQRT-Lasso is as efficiet as PISTA for Lasso, which matches our computatioal aalysis. Moreover, we also examie the optimal regularizatio parameters for Lasso ad SQRT-Lasso. We visualize the distributio of all 500 selected λ opt s usig the kerel desity estimator. Particularly, we adopt the Gaussia kerel, ad the kerel badwidth is selected based o the 0-fold cross validatio. Figure 4 illustrates the estimated desity fuctios. The horizotal axis correspods to the rescaled regularizatio parameter λ opt / log d/. We see that the optimal regularizatio parameters of Lasso sigificatly vary with differet σ. I cotrast, the optimal regularizatio parameters of SQRT-Lasso are more cocetrated. This is cosistet with the claimed tuig isesitivity. Fially, we compare PIS TA with ADMM over real data sets for precisio matrix estimatio. Particularly, we use four real world biology data sets preprocessed by Li ad Toh 00: Estroge d = 69, Arabidopsis d = 834, Leukemia d =, 5, Hereditary d =, 869. We set three differet values for λ N such that the obtaied estimators achieve differet levels of sparse recovery. We set N = 50, ad ε K = 0 4 for all K s. The timig performace is summarized i Table. As ca be see, PIS TA is 5 to 0 times faster tha ADMM o all four data sets 3. We do ot have ay results for the algorithm proposed i Belloi et al. 0, because it failed to fiish 500 simulatios i hours. The implemetatio was based o SDPT3. 3 We do ot have ay results for the algorithm proposed i Belloi et al. 0, because it failed to fiish the experimets o all four data sets i hours. The implemetatio was also based o SDPT3.

12 Table : Quatitative compariso betwee Lasso PISTA ad SQRT-Lasso PIS TA ad ADMM over 500 simulatios. The estimatio error is defied as θ θ. The predictio error is defied as ỹ X θ [ N] /. The residual is defied as y X θ [ N] for PIS TA oly. PIS TA attais early the same estimatio ad predictio errors as ADMM, but is sigificatly faster tha ADMM over all differet settigs. Besides, we also fid that all obtaied estimators are outside the smoothed regio throughout all 500 simulatios. Variace Est. Err. Pred. Err. Time Secod Residual of Noise Lasso PIS TA Lasso PIS TA Lasso PIS TA ADMM PIS TA σ = σ = σ = σ = Table : Timig compariso betwee PIS TA ad ADMM o biology data uder differet levels of sparsity recovery. PIS TA is sigificatly faster tha ADMM over all settigs ad data sets. Estroge Arabidopsis Leukemia Hereditary PIS TA ADMM PIS TA ADMM PIS TA ADMM PIS TA ADMM Sparsity % Sparsity 3% Sparsity 0% =0.5 = = = a Lasso: Distributio of λ opt/ log d/ =0.5 =.5 = = b SQRT-Lasso: Distributio of λ opt/ log d/ Figure 4: Estimated distributios of λ opt / log d/ over differet values of σ for Lasso ad PIS TA.

13 Refereces Agarwal, A., Negahba, S. ad Waiwright, M. J. 00. Fast global covergece rates of gradiet methods for high-dimesioal statistical recovery. I Advaces i Neural Iformatio Processig Systems. Beck, A. ad Teboulle, M. 0. Smoothig ad first order methods: A uified framework. SIAM Joural o Optimizatio Belloi, A., Cherozhukov, V. ad Wag, L. 0. Square-root lasso: pivotal recovery of sparse sigals via coic programmig. Biometrika Bickel, P. J., Ritov, Y. ad Tsybakov, A. B Simultaeous aalysis of lasso ad datzig selector. The Aals of Statistics Bühlma, P. ad va de Geer, S. 0. Statistics for high-dimesioal data: methods, theory ad applicatios. Spriger. Cades, E. J. ad Tao, T Decodig by liear programmig. IEEE Trasactios o Iformatio Theory Johstoe, I. M. 00. Chi-square oracle iequalities. Lecture Notes-Moograph Series Li, L. ad Toh, K.-C. 00. A iexact iterior poit method for l -regularized sparse covariace selectio. Mathematical Programmig Computatio Li, X., Zhao, T., Yua, X. ad Liu, H. 05. The flare package for high dimesioal liear regressio ad precisio matrix estimatio i R. The Joural of Machie Learig Research Liu, H. ad Wag, L. 0. Tiger: A tuig-isesitive approach for optimally estimatig Gaussia graphical models. Tech. rep., Massachusett Istitute of Techology. Negahba, S. N., Ravikumar, P., Waiwright, M. J. ad Yu, B. 0. A uified framework for high-dimesioal aalysis of m-estimators with decomposable regularizers. Statistical Sciece Nesterov, Y Itroductory lectures o covex optimizatio: A basic course, vol. 87. Spriger. Nesterov, Y Smooth miimizatio of o-smooth fuctios. Mathematical Programmig Nesterov, Y. 03. Gradiet methods for miimizig composite fuctios. Mathematical Programmig

14 Raskutti, G., Waiwright, M. J. ad Yu, B. 00. Restricted eigevalue properties for correlated Gaussia desigs. The Joural of Machie Learig Research Raskutti, G., Waiwright, M. J. ad Yu, B. 0. Miimax rates of estimatio for high-dimesioal liear regressio over-balls. Iformatio Theory, IEEE Trasactios o Rudelso, M. ad Zhou, S. 03. Recostructio from aisotropic radom measuremets. Iformatio Theory, IEEE Trasactios o Tibshirai, R Regressio shrikage ad selectio via the lasso. Joural of the Royal Statistical Society, Series B Waiwright, J. 05. High-dimesioal statistics: A o-asymptotic viewpoit. preparatio. Uiversity of Califoria, Berkeley. Xiao, L. ad Zhag, T. 03. A proximal-gradiet homotopy method for the sparse least-squares problem. SIAM Joural o Optimizatio Ye, F. ad Zhag, C.-H. 00. Rate miimaxity of the lasso ad datzig selector for the lq loss i lr balls. The Joural of Machie Learig Research Zhag, C.-H. ad Huag, J The sparsity ad bias of the lasso selectio i high-dimesioal liear regressio. The Aals of Statistics Zhag, T Some sharp performace bouds for least squares regressio with l regularizatio. The Aals of Statistics

15 A Itermediate Results of Theorem 3.4 ad Theorem 3.7 We itroduce some importat implicatios of the proposed assumptios. Recall that S = {j : θj 0} be the idex set of o-zero etries of θ with s = S ad S = {j : θj = 0} be the complemet set. From Lemma 3.6, Assumptio 3.3 implies RSC ad RSS with parameter ρ s + s ad ρ + s + s respectively. By Nesterov 004, the followig coditios are equivalet to RSC ad RSS, i.e., for ay v, w R d satisfyig v w 0 s + s, ρ + s + s ρ s + s v w v w L µ w ρ + s + s v w, L µ v L µ w v w L µ w ρ L µ v L µ w. s + s From the covexity of l orm, we have v w v w g, A. A. A.3 where g w. Combiig ad A. ad A.3, we have for ay v, w R d satisfyig v w 0 s + s, F µ,λ v F µ,λ w v w F µ,λ w ρ s + s v w, A.4 Remark A.. For ay t ad k, the lie search satisfies L t Lt L t max, L µ L Lt L µ ad ρ + s + s where L µ = mi{l : L µ v L µ w L x y, v, w R d }. L t Lt ρ+ s + s, A.5 We first show that whe θ is sparse ad the approximate KKT coditio is satisfied, the both estimatio error i l orm ad objective error, w.r.t. the true model parameter, are bouded. This characterizes that the iitial value θ 0 of K-th path followig stage has desirable statistical properties if we iitialize θ 0 = θ [K ]. This is formalized i Lemma A., ad its proof is deferred to Appedix N.. Lemma A.. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies θ S 0 s ad the approximate KKT coditio the we have mi L µ θ + λg λ/, g θ A.6 θ θ S 5 θ θ S, A.7 θ θ λ s ρ, A.8 s + s θ θ λs ρ, s + s A.9 F µ,λ θ F µ,λ θ 6λ s. A.0 ρ s + s 5

16 Next, we show that if θ is sparse ad the objective error is bouded, the the estimatio error is also bouded. This characterizes that withi the K-th path followig stage, good statistical performace is preserved after each proximal-gradiet update. This is formalized i Lemma A.3, ad its proof is deferred to Appedix N.. Lemma A.3. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies θ S 0 s ad the objective satisfies the we have F µ,λ θ F µ,λ θ 6λ s ρ s + s θ θ 4λ 3s ρ, A. s + s θ θ 4λs ρ. A. s + s We the show that if θ is sparse ad the objective error is bouded, the each proximal-gradiet update preserves solutio to be sparse. This demostrates that withi the K-th path followig stage, each update of θ t is preserved to be sparse ad has good statistical performace. This is formalized i Lemma A.4, ad its proof is deferred to Appedix N.3. Lemma A.4. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies θ S 0 s, L satisfies L < ρ + s + s, ad the objective satisfies the we have F µ,λ θ F µ,λ θ 6λ s ρ s + s T L,λ θ S 0 s. A.3 Moreover, we show that if θ satisfies the approximate KKT coditio, the the objective has a bouded error w.r.t. the regularizatio parameter λ. This characterizes the geometric decrease of the objective error whe we choose a geometrically decreasig sequece of regularizatio parameters. This is formalized i Lemma A.5, ad its proof is deferred to Appedix N.4. Lemma A.5. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ λ N. If θ satisfies ω λ θ λ/, For ay λ [λ N, λ], let θ = argmi θ F µ, λθ. The we have F µ, λθ F µ, λθ λ + λ ω λ θ + λ λ ρ s + s s. 6

17 Furthermore, we show that each path followig stage has a local liear covergece rate if the iitial value θ 0 is sparse ad satisfies the approximate KKT coditio with adequate precisio. Besides, the estimatio after each proximal gradiet update is also sparse. This is the key result i demostratig the overall geometric covergece rate of the algorithm. This is formalized i Lemma A.6, ad its proof is deferred to Appedix N.5. Lemma A.6. Suppose Assumptio 3.3 holds. If the iitializatio θ 0 for every stage with ay λ i Algorithm satisfies The for ay t =,,..., we have θ t 0 s, where θ = argmi θ F µ,λ θ. F µ,λ θ t F µ,λ θ θ 0 0 s. 8κ s + s t F µ,λθ 0 F µ,λθ, I additio, we provide the subliear covergece rate whe RSC does ot hold based o a refied aalysis of the covergece rate for covex objective ot strogly covex via proximal gradiet method with lie search Nesterov 03. Specifically, we provide a sub-liear rate of covergece without the eed to classify the distace of the iitial objective to the optimal objective. This characterizes the covergece behavior whe Xθ θ is large. We formalize this i Lemma A.7, ad provide the proof i Appedix N.6. Lemma A.7 Refied result of Theorem 4 i Nesterov 03. Give the iitializatio θ 0, if for ay θ R d that satisfies F µ,λ θ F µ,λ θ 0, deote R as The for ay t =,,..., we have where θ = argmi θ F µ,λ θ. θ θ R. F µ,λ θ t F µ,λ θ 4 X R t + µ, A.4 Fially, we itroduce two results characterizig the proximal gradiet mappig operatio, adapted from Nesterov 03 ad Xiao ad Zhag 03 without proof. The first lemma describes sufficiet descet of the objective by proximal gradiet method. Lemma A.8 Adapted from Theorem i Nesterov 03. For ay L > 0, Q µ,λ T L,λ θ, θ F µ,λ θ L T L,λθ θ. 7

18 Besides, if L µ θ is covex, we have Further, we have for ay L L µ, Q µ,λ T L,λ θ, θ mi x F µ,λ x + L x θ. A.5 F µ,λ T L,λ θ Q µ,λ T L,λ θ, θ F µ,λ θ L T L,λθ θ. A.6 The ext lemma provides a upper boud of the optimal residue ω. Lemma A.9 Adapted from Lemma i Xiao ad Zhag 03. For ay L > 0, if L µ is the Lipschitz costat of L µ, the ω λ T L,λ θ L + S L θ T L,λ θ θ L + L µ T L,λ θ θ, where S L θ = LµT L,λθ L µθ T L,λ θ θ is a local Lipschitz costat, which satisfies S L θ L µ. B Proof of Theorem 3.4 We first demostrate the subliear rate for iitial stages whe the estimatio error θ θ is large due large regularizatio parameter λ K. The proof is provided i Appedix F Theorem B.. For ay K =,..., N ad λ K > 0, let θ = argmi θ F µ,λk θ be the optimal solutio of K-th stage with regularizatio parameter λ K. For ay θ R d that satisfies F µ,λk θ F µ,λk θ 0, let θ θ. If the iitial value θ 0 satisfies ω λ K θ 0 λ K/, the withi K-th stage, for ay t =,,..., we have F µ,λ θ t F µ,λθ 4 X R t + µ, B. To achieve the approximate KKT coditio ω λk θ t ε K, the umber of proximal gradiet steps is o more tha 8 X 3 R ε K µ 3/ 3/4 L µ 3. B. Note that L µ X /µ X. The B. ca be simplified as O R ε K µ. Next, we demostrate the liear rate whe the estimator satisfies θ B s + s r. The proof is provided i Appedix G. Theorem B.. Suppose Assumptio 3. ad Assumptio 3.3 hold, ad λ K > 0 for ay K = N,..., N. Let θ = argmi θ F µ,λk θ be the optimal solutio of K-th stage with regularizatio 8

19 parameter λ K. If the iitial value θ 0 satisfies ω λ K θ 0 λ K/ with θ 0 S 0 s, the withi K-th stage, for ay t =,,..., we have θ t S 0 s, θ t θ F µ,λk θ t F µ,λ K θ 8κ s + s 8κ s + s t 4λK s ω λk θ t ρ s + s ad t 4λK s ω λk θ t ρ s + s, B.3 For K = N,..., N, to achieve the approximate KKT coditio ω λk θ t λ K/4, the umber of proximal gradiet steps is o more tha log κ s + s s κ s + s. B.4 log 8κ s + s/8κ s + s For K = N, to achieve the approximate KKT coditio ω λn θ t [N] ε N, the umber of proximal gradiet steps is o more tha log 96 + κ s + s λ N s κ s + s/ε N. B.5 log 8κ s + s/8κ s + s From basic iequalities, sice κ s + s, we have 8κs log + s log +. 8κ s + s 8κ s + s 8κ s + s The B.4 ad B.5 ca be simplified as O κ s + s log κ 3 s + s s ad O κ s + s log κ 3 s + s λ N s /ε N respectively. As ca be see from Theorem B., whe the iitial value θ 0 satisfies the approximate KKT coditio ω λk θ 0 λ K/ with θ 0 Bs + s r, the we ca guaratee the geometric covergece rate of the estimated objective value towards the miimal objective. Next, we show that if the the optimal solutio θ [K ] from K -th path followig stage satisfies the approximate KKT coditio ad the regularizatio parameter λ K i the K-th path followig stage is chose properly, the θ [K ] satisfies the approximate KKT coditio for λ K with a slightly larger boud. This characterizes that good computatioal properties are preserved by usig the warm start θ 0 = θ [K ] ad geometric sequece of regularizatio parameters λ K. We formalize this otio i Lemma B.3, ad its proof is deferred to Appedix H. Lemma B.3. Let θ [K ] be the approximate solutio of K -th path followig state, which satisfies the approximate KKT coditio ω λk θ [K ] λ K /4. The we have where λ K = ηλ K with η 5/6,. ω λk θ [K ] λ K /, 9

20 Combiig Theorem B. ad Lemma B.3, we ca achieve the global covergece i terms of the objective value usig the path followig proximal gradiet method. We have the bouds of iteratios T K i Phase directly from B.4 ad B.5 of Theorem B.. Fially, we obtai the objective gap of N-th stage via aalogous argumet. Specifically, i the N-th fial path followig stage, whe the umber of iteratios for proximal method is large eough such that ω λn θ t [N] ε N holds, the we obtai the result from Lemma A.5 with λ = λ = λ N. Fially, we eed to show that there exists some N {,..., N} such that θ [N ] B s + s r. We demostrate this result i Lemma B.4 ad provide its proof i Appedix I. Lemma B.4. Suppose Assumptio 3. ad Assumptio 3.3 holds, ad the approximate KKT satisfies ω λ θ λ/4. If µ σ/4 ad ρ s + s 8 r s > λ > λ N, the we have θ θ r ad θ S 0 s. Lemma B.4 guaratees that there exits some N < N such that for some λ N > λ N, the approximate solutio θ [N ] satisfies the approximate KKT coditio ad θ [N ] B s + s r, the we eters the Phase of strog liear covergece by Theorem B.. Thus we fiish the proof. We further provide a boud of the total umber of proximal gradiet steps for each λ K i the followig lemma for iterested readers. The proof is provided i Appedix J. Lemma B.5. For each λ K, K =,..., N, if we restart the lie search with a large eough L max, the the total umber of proximal gradiet steps is o more tha T K ++max{log L max log ρ + s + s, 0}. C Proof of Lemma 3.6 Part. We first show that Assumptio 3. holds. By y = Xθ + ɛ ad C.4, we have L µ θ = X Xθ y max{ µ, y Xθ } = X ɛ max{ µ, ɛ }. C. Sice ɛ has i.i.d. sub-gaussia etries ad E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,, the we have from Waiwright 05 that P [ ɛ 4 ] σ exp, C. 3 By Negahba et al. 0, we have the followig result. Lemma C.. Assume X satisfies x j for all j =,..., d ad ɛ has i.i.d. zero-mea sub-gaussia etries with E[wi ] = σ for all i =,...,, the we have [ ] log d P X ɛ σ d. 0

21 Combiig C., C. ad Lemma C., we have with probability at least d exp 3, L µ θ 4 log d/ max{ µ/ σ, }. Part. Next, we show that Assumptio 3.3 holds. We divide the proof ito two steps. Step. Whe X satisfies the RE coditio, i.e. ψ mi v log d ϕ mi v Xv ψ max v + ϕ max log d v, Deote s = s + s. Sice v 0 s, which implies v s v, the we have s log d ψ mi ϕ mi v Xv s log d ψ max + ϕ max v, The there exists a uiversal costat c such that if c s log d, we have ψ mi v Xv ψ max v. C.3 Step. Coditioig o C.3, we show that L µ satisfies LRSC ad LRSS with high probability. The gradiet of L µ θ is y L µ θ = Xθ µ y Xθ = y Xθ θ The Hessia of L µ θ is L µ θ = X z = θ X X µ, if y Xθ < µ y Xθ X X Xθ y max{ µ, y Xθ }. I y Xθy Xθ y Xθ X, o.w. C.4 C.5 For otatioal coveiece, we defie = v w for ay v, w B s. Also deote the residual of the first order Taylor expasio as δl µ w +, w = L µ w + L µ w L µ w. Usig the first order Taylor expasio of L µ θ at w ad the Hessia of L µ θ i C.5, we have from mea value theorem that there exists some α [0, ] such that X X µ, if ξ < µ δl µ w +, w = ξ X I ξξ X, o.w. ξ where ξ = y Xw +α. For otatioal simplicity, let s deote ż = Xv θ ad z = Xw θ, which ca be cosidered as two fixed vectors i R. Without loss of geerality, assume ż z. The we have ż z ψ max w θ σ 4.

22 Further, we have ξ = y Xw + α = ɛ Xw + α θ = ɛ αż α z, ad X = ż z. We have from Waiwright 05 that P [ ɛ σ δ ] exp The by takig δ = /3 i C.6, we have with probability exp 44, ξ ɛ α ż α z ɛ z 4 5 δ, C.6 6 σ σ σ. 4 We first discuss the RSS property. From C.7, we have ξ µ, the from C.7 we have δl µ w +, w = X I ξξ ξ ξ X = X ξ X ξ ξ X 8ψ max ξ σ C.7 Next, we verify the RSC property. From C.7, we have ξ µ. We wat to show that with high probability, for ay costat a 0, ξ X ξ a X. C.8 Cosequetly, we have X I ξξ ξ ξ X = X X a X ξ. This further implies δl µ w +, w = X I ξξ ξ ξ X aψ mi ξ /. C.9 Sice ż z, the for ay real costat a 0,, [ ξ P X ξ ] a X [ ɛ αż α z = P ż z ɛ αż α z ] a ż z [ i ɛ ż ż z P ɛ ż ] a ż z ] [ = P ɛ ż z ż ż z a ɛ ż ż z [ ii ɛ ] ż z = P + ż ɛ ż ż z a ɛ + ż ɛ ż, C.0

23 where ii is from dividig both sides by v, ad i is from a geometric ispectio ad the radomess of ɛ, i.e., for ay α [0, ] ad ż z, ż ż z ż αż α z ż z αż α z. The radom vector ɛ with i.i.d. etries does ot affect the iequality above. Let s first discuss oe side of the probability i C.0, i.e., [ ɛ ] ż z P + ż ɛ ż a ɛ + ż ɛ ż ż z [ ɛ = P a ɛ ] ż z + a ż ɛ ż. C. ż z Sice ɛ has i.i.d. sub-gaussia etries with E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,, the ɛ ż z ż z ad ɛ ż are also zero-mea sub-gaussias with variaces σ ad σ ż respectively. We have from Waiwright 05 that P [ ɛ σ δ ] exp δ, C. 6 [ ɛ ] ż z P σ δ exp δ, C.3 ż z [ ] P ɛ ż σ δ exp σ δ ż. C.4 Combiig C. C.4 with ż σ /4, we have from uio boud that with probability at least exp 44 exp 8 exp 8 3 exp 44, ɛ 3 σ, ɛ ż z ż z 64 σ, ɛ ż 6 σ. This implies for a 3/5, we have ξ ξ X a X. For the other side of the probability i C.0, we have [ ɛ ż z P ż z [ ɛ = P a ɛ ż z ż z [ ɛ P a ɛ ż z ż z + ż ɛ ż a ɛ + ż ɛ ż a ż ɛ ż + a ż ɛ ż ] ] ]. C.5 3

24 Combiig C.0, C. ad C.5, we have C.8 holds with high probability, i.e., for ay r > 0, [ ξ P X ξ ] a X 6 exp. 44 Now wo boud ξ to obtai the desired result. From Waiwright 05, we have P [ ɛ σ + δ ] exp δ = exp, C where we take δ = /. From ξ = ɛ αż α z, we have i ii ξ ɛ + α ż + α z ɛ + z 3 σ + σ. C.7 where i is from ż z ad ii is from C.6 ad ż σ /4. The by the uio boud settig a = /, with probability at least 7 exp 44, we have δl µ w +, w ψ mi 8σ. Moreover, we also have r = σ 8ψ max > s 64σλ N /ψ mi s 8λ N /ρ s + s for large eough c s log d, where λ N λ N = 48 log d/. The choice of the costat i λ N λ N is somewhat arbitrary, which ca be ay fixed costat larger tha /η such that the existece of λ N is guarateed. D Proof of Theorem 3.7 Part. We first show that estimatio errors are as claimed. Sice θ is the approximate solutio of K-th path followig stage, it satisfies ω λk θ λ K /4 λ K+ / for t [N +, T ], the we have from Lemma B.3 that ω λk+ θ 0 [K+] λ K+/. By Theorem B., we have for ay t =,,..., Applyig Lemma A. recursively, we have θ [N] θ λ N s θ t [K+] S 0 s. ρ s + s ad θ [N] θ λ Ns. ρ s + s Applyig Lemma 3.6 with λ N = 4 log d/ ad ρ s + s = ψ mi 8σ, the by uio boud, with probability at least 8 exp 44 d, we have θ [N] θ 384σ s log d/ ψ mi, θ [N] θ 304σs log d/ ψ mi. 4

25 Part. Next, we demostrate the result of the estimatio of variace. Let θ [N] = argmi θ F µ,λn θ be the optimal solutio of K-th stage. Apply the argumet i Part recursively, we have Deote c, c,... as positive uiversal costats. The we have θ [N] θ 304σs log d/ ψ mi. D. L µ θ [N] L µ θ λ N θ θ [N] λ N θ S θ [N] S θ [N] S λ N θ [N] θ S λ N θ [N] θ ii σs log d c, D. where i is from the value of λ N ad l error boud i D.. O the other had, from the covexity of L µ θ, we have L µ θ [N] L µ θ θ [N] θ L µ θ L µ θ θ [N] θ i ii σs log d c λ N θ [N] θ c 3, D.3 where i is from Assumptio 3. ad ii value of λ N ad l error boud i D.. For our choice of µ ad, we have L µ θ = y Xθ µ by Propositio 3.8, the L µ θ [N] L µ θ = y Xθ [N] ɛ. D.4 From Waiwright 05, we have for ay δ > 0, [ ɛ ] P σ σ δ exp δ. D.5 8 Combiig D., D.3, D.4 ad D.5 with δ = c 3s log d, we have with high probability, y Xθ [N] σs σ = O log d. D.6 From Part, for c 4 s log d, we have with high probability, θ [N] θ 384σ s log d/ ψ mi σ ψ max, the θ [N] B s + s r ad θ [N] θ [N] 0 s + s. The from the aalysis of Theorem B., we have ω λk θ t+ + κ s + s 4ρ + s + s F µ,λk θ t F µ,λ K θ ε N. This implies F µ,λk θ t F µ,λ K θ ɛ N 4ρ + s + s + κ s + s. D.7 5

26 O the other had, from the LRSC property of L µ, covexity of l orm ad optimality of θ, we have F µ,λk θ t F µ,λ K θ ρ s + s θ [N] θ [N]. D.8 Combiig D.7, D.8 ad Assumptio 3.5, we have X θ [N] θ [N] 8ρ + s + s θ σ [N] θ Combiig D.6 ad D.9, we have y X θ [N] σρ s + s ɛ N + κ s + s 4ɛ N + κ s + s ψ mi. D.9 y Xθ [N] + X θ [N] θ [N] y Xθ [N] + 4ɛ N + κ s + s ψ mi. If ɛ N c 5 σs log d for some costat c 5, the we have the desired result. E Proof of Propositio 3.8 Let θ ad θ be the uique global optima of.3 ad.4 respectively for all K = N +,..., N. The, we show θ = θ uder the proposed coditios. Apply the argumet of the proof of Theorem 3.7 recursively, we have with probability at least 8 exp 44 d, θ θ 384σ s log d/ ψ mi. By SE coditio of X i Assumptio 3.3, this implies O the other had, we have Xθ θ 384σ ψ max s log d ψ mi. E. y Xθ = Xθ θ + ɛ ɛ Xθ θ. E. Sice ɛ has i.i.d. sub-gaussia etries with E[ɛ i ] = 0 ad E[ɛ i ] = σ for all i =,...,, we have from Waiwright 05 that P [ ɛ 3 ] σ exp, E

27 Combiig E. ad E., E.3 ad the coditio o c 4 s log d for some costat c 4, we have with probability at least 9 exp 44 d, y Xθ σ 3 384σ ψmax s log d/ σ > µ. 4 This implies F µ,λ θ = F µ θ + µ, thus argmi θ F µ,λ θ = argmi θ F µ θ, i.e., θ T = θ. Besides this also implies y Xθ > σ 4 µ, i.e., θ is ot i the smoothed regio. Applyig the same argumet agai to θ, we have that for large eough, with high probability, ψ mi y X θ > σ 4 µ, ad r = σ 8ψ max > s 64σλ N /ψ mi s 8λ N /ρ s + s is guarateed, where λn λ N = 48 log d/. By Lemma B.4, this implies the existece of the liear covergece regio, which does ot fall ito the smoothed regio. Besides, θ is ot i the smoothed regio. F Proof of Theorem B. The sub-liear rate of covergece B. follows directly from Lemma A.7. I terms of the optimal residual, we have ωλ K θ t+ i L t + S Lt θ t θ t+ θ t ii iii iv + X / t+ µ θ θ t Lt + X / µ Lt k m + 8 X 4 k m + µ Fµ,λ m K θ F µ,λ K θ t+ m k i=m F µ,λ K θ m F µ,λ K θ m+ L µ mi i [m,...,k] { L i } 8 X 4 k m + µ Fµ,λ K θ F µ,λ K θ v 7R X 6 L µ k m + m + 3/ µ 3 L µ vi 88R X 6 L µ k + 3 3/ µ 3, F. where i is from Lemma A.9, ii is from S Lt θ t L µ X / µ i Lemma A.9 ad t Lemma A.7, iii is from A.6 i Lemma A.8, iv is from L µ L L µ X / µ i Remark A. ad Lemma 3.6, v is from Lemma A.7 ad vi is obtaied by choosig m = k/. To achieve the approximate KKT coditio ω λk θ t ε K, we require the R.H.S. of F. to be o greater tha ε K, the we have the desired result B.. 7

28 G Proof of Theorem B. Note that the RSS property implies that lie search termiate whe L t satisfies ρ + s + s L t ρ+ s + s. G. Sice the iitializatio θ 0 satisfies ω λ K θ 0 λ K/ with θ 0 S 0 s, the by Lemma A., the objective satisfies The by Lemma A.4, we have F µ,λk θ 0 F µ,λθ 6λ K s. θ S 0 s. ρ s + s By mootoe decrease of F µ,λk θ t from A.6 i Lemma A.8 ad recursively applyig Lemma A.4, θ t S 0 s holds i B.3 for ay t =,,.... For the objective error, we have F µ,λk θ t F µ,λ K θ i ii 8κ s + s 8κ s + s t F µ,λk θ0 F µ,λ K θ t 4λK s ω λk θ t ρ s + s, G. where i is from Lemma A.6, ad ii is from Lemma A.5 with λ = λ = λ K ad ω λk θ t+ λ K / λ K, which results i B.3. Combiig G. ad A.4, we have θ t θ ρ F µ,λk θ t F µ,λ K θ F µ,λk θ s + s = ρ F µ,λk θ t F µ,λ K θ s + s 8κ s + s t 4λK s ω λk θ t ρ s + s 8

29 For the optimal residue ω λk θ t+ of t + -th iteratio of K-th path followig stage, we have ω λk θ t+ i L t + S Lt iii iv L t L t θ t + ρ+ s + s ρ s + s + ρ+ s + s ρ s + s v + κ s + s vi + κ s + s θ t+ θ t+ 4ρ + s + s 96λ K s κ s + s θ t ii Lt + ρ+ s + s θ t+ θ t θ t F µ,λk θ t F µ,λ K θ t+ L t F µ,λk θ t F µ,λ K θ 8κ s + s t, G.3 where i is from Lemma A.9, ii is from S Lt θ t ρ+ s + s, iii is from ρ t s + s L i G., t iv is from A.6 i Lemma A.8, v is from L ρ+ s + s i G. ad mootoe decrease of F µ,λk θ t from A.6 i Lemma A.8, ad vi is from G. ad κ s + s = ρ+ s + s. ρ s + s For K-th path followig stage, K =,..., N, to have ω λk θ t+ λ K /4, we set the R.H.S. of G.3 to be o greater tha λ K /4, which is equivalet to require the umber of iteratios k to be a upper boud of B.4. For the last N-th path followig stage, we eed ω λn θ [N] ε N λ N /4. Set the R.H.S. of G.3 to be o greater tha ε N, which is equivalet to require the umber of iteratios k to be a upper boud of B.5. H Proof of Lemma B.3 Sice ω λk θ [K ] λ K /4, there exists some subgradiet g θ [K ] such that L µ θ [K ] + λ K g λ K /4. H. By the defiitio of ω λk, we have ω λk θ [K ] L µ θ [K ] + λ K g = L µ θ [K ] + λ K g + λ K λ K g L µ θ [K ] + λ K g + λ K λ K g i λ K /4 + ηλ K ii λ K /, where i is from H. ad choice of λ K, ii is from the coditio o η. 9

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Supplemental Material: Proofs

Supplemental Material: Proofs Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba, UC Berkeley Pradeep Ravikumar, UT Austi Marti Waiwright, UC Berkeley Bi Yu, UC Berkeley NIPS

More information

Accuracy Assessment for High-Dimensional Linear Regression

Accuracy Assessment for High-Dimensional Linear Regression Uiversity of Pesylvaia ScholarlyCommos Statistics Papers Wharto Faculty Research -016 Accuracy Assessmet for High-Dimesioal Liear Regressio Toy Cai Uiversity of Pesylvaia Zijia Guo Uiversity of Pesylvaia

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Lecture 12: February 28

Lecture 12: February 28 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead) Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Ω ). Then the following inequality takes place:

Ω ). Then the following inequality takes place: Lecture 8 Lemma 5. Let f : R R be a cotiuously differetiable covex fuctio. Choose a costat δ > ad cosider the subset Ωδ = { R f δ } R. Let Ωδ ad assume that f < δ, i.e., is ot o the boudary of f = δ, i.e.,

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

On forward improvement iteration for stopping problems

On forward improvement iteration for stopping problems O forward improvemet iteratio for stoppig problems Mathematical Istitute, Uiversity of Kiel, Ludewig-Mey-Str. 4, D-24098 Kiel, Germay irle@math.ui-iel.de Albrecht Irle Abstract. We cosider the optimal

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Graphical Nonconvex Optimization via an Adaptive Convex Relaxation

Graphical Nonconvex Optimization via an Adaptive Convex Relaxation Graphical Nocovex Optimizatio via a Adaptive Covex Relaxatio Qiag Su 1 Kea Mig Ta Ha Liu 3 Tog Zhag 3 Abstract We cosider the problem of learig highdimesioal Gaussia graphical models. The graphical lasso

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

APPENDIX A SMO ALGORITHM

APPENDIX A SMO ALGORITHM AENDIX A SMO ALGORITHM Sequetial Miimal Optimizatio SMO) is a simple algorithm that ca quickly solve the SVM Q problem without ay extra matrix storage ad without usig time-cosumig umerical Q optimizatio

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Stochastic Simulation

Stochastic Simulation Stochastic Simulatio 1 Itroductio Readig Assigmet: Read Chapter 1 of text. We shall itroduce may of the key issues to be discussed i this course via a couple of model problems. Model Problem 1 (Jackso

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Regularity conditions

Regularity conditions Statistica Siica: Supplemet MEASUREMENT ERROR IN LASSO: IMPACT AND LIKELIHOOD BIAS CORRECTION Øystei Sørese, Aroldo Frigessi ad Mage Thorese Uiversity of Oslo Supplemetary Material This ote cotais proofs

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

On cross-validated Lasso

On cross-validated Lasso O cross-validated Lasso Deis Chetverikov Zhipeg Liao The Istitute for Fiscal Studies Departmet of Ecoomics, UCL cemmap workig paper CWP47/6 O Cross-Validated Lasso Deis Chetverikov Zhipeg Liao Abstract

More information

Estimation of Backward Perturbation Bounds For Linear Least Squares Problem

Estimation of Backward Perturbation Bounds For Linear Least Squares Problem dvaced Sciece ad Techology Letters Vol.53 (ITS 4), pp.47-476 http://dx.doi.org/.457/astl.4.53.96 Estimatio of Bacward Perturbatio Bouds For Liear Least Squares Problem Xixiu Li School of Natural Scieces,

More information

On Cross-Validated Lasso

On Cross-Validated Lasso O Cross-Validated Lasso Deis Chetverikov Zhipeg Liao Abstract I this paper, we derive a rate of covergece of the Lasso estimator whe the pealty parameter λ for the estimator is chose usig K-fold cross-validatio;

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Doubly Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization with Factorized Data

Doubly Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization with Factorized Data Doubly Stochastic Primal-Dual Coordiate Method for Regularized Empirical Risk Miimizatio with Factorized Data Adams Wei Yu, Qihag Li, Tiabao Yag Caregie Mello Uiversity The Uiversity of Iowa weiyu@cs.cmu.edu,

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Vector Quantization: a Limiting Case of EM

Vector Quantization: a Limiting Case of EM . Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z

More information

Lecture 11: Decision Trees

Lecture 11: Decision Trees ECE9 Sprig 7 Statistical Learig Theory Istructor: R. Nowak Lecture : Decisio Trees Miimum Complexity Pealized Fuctio Recall the basic results of the last lectures: let X ad Y deote the iput ad output spaces

More information

Variable selection in principal components analysis of qualitative data using the accelerated ALS algorithm

Variable selection in principal components analysis of qualitative data using the accelerated ALS algorithm Variable selectio i pricipal compoets aalysis of qualitative data usig the accelerated ALS algorithm Masahiro Kuroda Yuichi Mori Masaya Iizuka Michio Sakakihara (Okayama Uiversity of Sciece) (Okayama Uiversity

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

Differentiable Convex Functions

Differentiable Convex Functions Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for

More information

MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign

MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING Yuheg Bu Jiaxu Lu Veugopal V. Veeravalli Uiversity of Illiois at Urbaa-Champaig Tsighua Uiversity Email: bu3@illiois.edu, lujx4@mails.tsighua.edu.c,

More information

Sequences. Notation. Convergence of a Sequence

Sequences. Notation. Convergence of a Sequence Sequeces A sequece is essetially just a list. Defiitio (Sequece of Real Numbers). A sequece of real umbers is a fuctio Z (, ) R for some real umber. Do t let the descriptio of the domai cofuse you; it

More information

Research Article The Oracle Inequalities on Simultaneous Lasso and Dantzig Selector in High-Dimensional Nonparametric Regression

Research Article The Oracle Inequalities on Simultaneous Lasso and Dantzig Selector in High-Dimensional Nonparametric Regression athematical Problems i Egieerig Volume 3, Article ID 5736, 6 pages http://dx.doi.org/.55/3/5736 Research Article The Oracle Iequalities o Simultaeous Lasso ad Datzig Selector i High-Dimesioal Noparametric

More information

Expectation-Maximization Algorithm.

Expectation-Maximization Algorithm. Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................

More information

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor

More information

Estimation Error of the Constrained Lasso

Estimation Error of the Constrained Lasso Estimatio Error of the Costraied Lasso Nissim Zerbib 1, Ye-Hua Li, Ya-Pig Hsieh, ad Volka Cevher Abstract This paper presets a o-asymptotic upper boud for the estimatio error of the costraied lasso, uder

More information

Mathematical Modeling of Optimum 3 Step Stress Accelerated Life Testing for Generalized Pareto Distribution

Mathematical Modeling of Optimum 3 Step Stress Accelerated Life Testing for Generalized Pareto Distribution America Joural of Theoretical ad Applied Statistics 05; 4(: 6-69 Published olie May 8, 05 (http://www.sciecepublishiggroup.com/j/ajtas doi: 0.648/j.ajtas.05040. ISSN: 6-8999 (Prit; ISSN: 6-9006 (Olie Mathematical

More information

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic

More information

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates Iteratioal Joural of Scieces: Basic ad Applied Research (IJSBAR) ISSN 2307-4531 (Prit & Olie) http://gssrr.org/idex.php?joural=jouralofbasicadapplied ---------------------------------------------------------------------------------------------------------------------------

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

Homework #2. CSE 546: Machine Learning Prof. Kevin Jamieson Due: 11/2 11:59 PM

Homework #2. CSE 546: Machine Learning Prof. Kevin Jamieson Due: 11/2 11:59 PM Homework #2 CSE 546: Machie Learig Prof. Kevi Jamieso Due: 11/2 11:59 PM 1 A Taste of Learig Theory 1. [5 poits] For i = 1,..., fix x i R d ad let y i = x T i w i.i.d. + ɛ i where ɛ i N (0, σ 2 ). All

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Homework Set #3 - Solutions

Homework Set #3 - Solutions EE 15 - Applicatios of Covex Optimizatio i Sigal Processig ad Commuicatios Dr. Adre Tkaceko JPL Third Term 11-1 Homework Set #3 - Solutios 1. a) Note that x is closer to x tha to x l i the Euclidea orm

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences Commuicatios of the Korea Statistical Society 29, Vol. 16, No. 5, 841 849 Precise Rates i Complete Momet Covergece for Negatively Associated Sequeces Dae-Hee Ryu 1,a a Departmet of Computer Sciece, ChugWoo

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION DOI: 10.1038/NPHYS309 O the reality of the quatum state Matthew F. Pusey, 1, Joatha Barrett, ad Terry Rudolph 1 1 Departmet of Physics, Imperial College Lodo, Price Cosort Road, Lodo SW7 AZ, Uited Kigdom

More information

The L[subscript 1] penalized LAD estimator for high dimensional linear regression

The L[subscript 1] penalized LAD estimator for high dimensional linear regression The L[subscript 1] pealized LAD estimator for high dimesioal liear regressio The MIT Faculty has made this article opely available. Please share how this access beefits you. Your story matters. Citatio

More information

Comparison of Minimum Initial Capital with Investment and Non-investment Discrete Time Surplus Processes

Comparison of Minimum Initial Capital with Investment and Non-investment Discrete Time Surplus Processes The 22 d Aual Meetig i Mathematics (AMM 207) Departmet of Mathematics, Faculty of Sciece Chiag Mai Uiversity, Chiag Mai, Thailad Compariso of Miimum Iitial Capital with Ivestmet ad -ivestmet Discrete Time

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Feedback in Iterative Algorithms

Feedback in Iterative Algorithms Feedback i Iterative Algorithms Charles Byre (Charles Byre@uml.edu), Departmet of Mathematical Scieces, Uiversity of Massachusetts Lowell, Lowell, MA 01854 October 17, 2005 Abstract Whe the oegative system

More information

Technical Proofs for Homogeneity Pursuit

Technical Proofs for Homogeneity Pursuit Techical Proofs for Homogeeity Pursuit bstract This is the supplemetal material for the article Homogeeity Pursuit, submitted for publicatio i Joural of the merica Statistical ssociatio. B Proofs B. Proof

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

ECE-S352 Introduction to Digital Signal Processing Lecture 3A Direct Solution of Difference Equations

ECE-S352 Introduction to Digital Signal Processing Lecture 3A Direct Solution of Difference Equations ECE-S352 Itroductio to Digital Sigal Processig Lecture 3A Direct Solutio of Differece Equatios Discrete Time Systems Described by Differece Equatios Uit impulse (sample) respose h() of a DT system allows

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

CMSE 820: Math. Foundations of Data Sci.

CMSE 820: Math. Foundations of Data Sci. Lecture 17 8.4 Weighted path graphs Take from [10, Lecture 3] As alluded to at the ed of the previous sectio, we ow aalyze weighted path graphs. To that ed, we prove the followig: Theorem 6 (Fiedler).

More information

Bayesian Methods: Introduction to Multi-parameter Models

Bayesian Methods: Introduction to Multi-parameter Models Bayesia Methods: Itroductio to Multi-parameter Models Parameter: θ = ( θ, θ) Give Likelihood p(y θ) ad prior p(θ ), the posterior p proportioal to p(y θ) x p(θ ) Margial posterior ( θ, θ y) is Iterested

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information