Randomized block proximal damped Newton method for composite self-concordant minimization

Size: px

Start display at page:

Download "Randomized block proximal damped Newton method for composite self-concordant minimization"

Leslie Miles
5 years ago
Views:

1 Randomzed block proxmal damped Newton method for composte self-concordant mnmzaton Zhaosong Lu June 30, 2016 Revsed: March 28, 2017 Abstract In ths paper we consder the composte self-concordant CSC mnmzaton problem, whch mnmzes the sum of a self-concordant functon f and a possbly nonsmooth proper closed convex functon g. The CSC mnmzaton s the cornerstone of the path-followng nteror pont methods for solvng a broad class of convex optmzaton problems. It has also found numerous applcatons n machne learnng. The proxmal damped Newton PDN methods have been well studed n the lterature for solvng ths problem that enjoy a nce teraton complexty. Gven that at each teraton these methods typcally requre evaluatng or accessng the Hessan of f and also need to solve a proxmal Newton subproblem, the cost per teraton can be prohbtvely hgh when appled to large-scale problems. Inspred by the recent success of block coordnate descent methods, we propose a randomzed block proxmal damped Newton RBPDN method for solvng the CSC mnmzaton. Compared to the PDN methods, the computatonal cost per teraton of RBPDN s usually sgnfcantly lower. The computatonal experment on a class of regularzed logstc regresson problems demonstrate that RBPDN s ndeed promsng n solvng large-scale CSC mnmzaton problems. The convergence of RBPDN s also analyzed n the paper. In partcular, we show that RBPDN s globally convergent when g s Lpschtz contnuous. It s also shown that RBPDN enjoys a local lnear convergence. Moreover, we establsh a global lnear rate of convergence for a class of g ncludng the case where g s smooth but not necessarly self-concordant and g s Lpschtz contnuous n a certan level set of f + g. As a consequence, we obtan a global lnear rate of convergence for the classcal damped Newton methods [24, 42] and the PDN [33] for such g, whch was prevously unknown n the lterature. Moreover, ths result can be used to sharpen the exstng teraton complexty of these methods. Keywords: Composte self-concordant mnmzaton, damped Newton method, proxmal damped Newton method, randomzed block proxmal damped Newton method. AMS subject classfcatons: 49M15, 65K05, 90C06, 90C25, 90C51 1 Introducton In ths paper we are nterested n the composte self-concordant mnmzaton: F = mn x {F x := fx + gx}, 1.1 Department of Mathematcs, Smon Fraser Unversty, Canada Emal: zhaosong@sfu.ca. supported n part by NSERC Dscovery Grant. Ths author was 1

2 where f : R N R := R { } s a self-concordant functon wth parameter M f 0 and g : R N R s a possbly nonsmooth proper closed convex functon. Specfcally, by the standard defnton of a self-concordant functon e.g., see [27, 24], f s convex and three tmes contnuously dfferentable n ts doman denoted by domf, and moreover, ψ 0 M f ψ 0 3/2 holds for every x domf and u R N, where ψt = fx + tu for any t R. In addton, f s called a standard self-concordant functon f M f = 2. It s well-known that problem 1.1 wth g = 0 s the cornerstone of the path-followng nteror pont methods for solvng a broad class of convex optmzaton problems. Indeed, n the semnal work by Nesterov and Nemrovsk [27], many convex optmzaton problems can be recast nto the problem: mn c, x, 1.2 x Ω where c R N, Ω R N s a closed convex set equpped wth a self-concordant barrer functon B, and, denotes the standard nner product. It has been shown that an approxmate soluton of problem 1.2 can be found by solvng approxmately a sequence of barrer problems: mn x {f t x := c, x + tbx}, where t > 0 s updated wth a sutable scheme. Clearly, these barrer problems are a specal case of 1.1 wth f = f t and g = 0. Recently, Tran-Dnh et al. [32] extended the aforementoned path-followng scheme to solve the problem mn x Ω gx, where g and Ω are defned as above. They showed that an approxmate soluton of ths problem can be obtaned by solvng approxmately a sequence of composte barrer problems: mn tbx + gx, x where t > 0 s sutably updated. These problems are also a specal case of 1.1 wth f = tb. In addton, numerous models n machne learnng are also a specal case of 1.1. For example, n the context of supervsed learnng, each sample s recorded as w, y, where w R N s a sample feature vector and y R s usually a target response or a bnary +1 or -1 label. A loss functon φx; w, y s typcally assocated wth each w, y. Some popular loss functons nclude, but are not lmted to: squared loss: φx; w, y = y w, x 2 ; logstc loss: φx; w, y = log1 + exp y w, x. A lnear predctor s often estmated by solvng the emprcal rsk mnmzaton model: 1 m mn φx; w, y + µ x m 2 x 2 +gx, }{{} fx where m s the sample sze and g s a regularzer such as l 1 norm. For stablty purpose, the regularzaton term µ x 2 /2, where µ > 0 and s the Eucldean norm, s often ncluded to make the model strongly convex e.g., see [42, 43]. It s easy to observe that when φ s the squared 2

3 loss, the assocated f s self-concordant wth parameter M f = 0. In addton, when φ s the logstc loss, y { 1, 1} for all and µ > 0, Zhang and Xao [42, 43] showed that the assocated f s self-concordant wth parameter M f = R/ µ, where R = max w. Besdes, they proved that the assocated f for a general class of loss functons φ s self-concordant, whch ncludes a smoothed hnge loss. As another example, the graphcal model s often used n statstcs to estmate the condtonal ndependence of a set of random varables e.g., see [41, 6, 9, 19], whch s n the form of: mn S, X log detx + ρ X j, X S++ N j where ρ > 0, S s a sample covarance matrx, and S N ++ s the set of N N postve defnte matrces. Gven that log detx s a self-concordant functon n S N ++ e.g., see [24], t s clear to see that the graphcal model s also a specal case of 1.1. When g = 0, problem 1.1 can be solved by a damped Newton DN method or a mxture of DN and Newton methods e.g., see [24, Secton 4.1.5]. To motvate our study, we now brefly revew these methods for solvng 1.1 wth g = 0. In partcular, gven an ntal pont x 0 domf, the DN method updates the terates accordng to x k+1 = x k + dk 1 + λ k, k 0, where d k s the Newton drecton and λ k s the local norm of d k at x k, whch are gven by: d k = 2 fx k 1 fx k, λ k = d k T 2 fx k d k. 1.3 The mxture of DN and Newton frst apples DN and then swtches to the standard Newton method.e., settng the step length to 1 once an terate s suffcently close to the optmal soluton. The dscusson n [24, Secton 4.1.5] has a drect mplcaton that both DN and the mxture of DN and Newton fnd an approxmate soluton x k satsfyng λ k ɛ n at most O F x 0 F + log log ɛ 1 teratons. Ths complexty can be obtaned by consderng two phases of these methods. The frst phase conssts of the teratons executed by DN for generatng a pont lyng n a certan neghborhood of the optmal soluton n whch the local quadratc convergence of DN or the standard Newton method s ensured to occur, whle the second phase conssts of the rest of the teratons. Indeed, O F x 0 F and Olog log ɛ 1 are an estmate of the number of teratons of these two phases, respectvely. Recently, Zhang and Xao [42, 43] proposed an nexact damped Newton IDN method for solvng 1.1 wth g = 0. Ther method s almost dentcal to DN except that the search drecton d k defned n 1.3 s nexactly computed by solvng approxmately the lnear system 2 fx k d = fx k. By controllng sutably the nexactness on d k and consderng the smlar two phases as above, they showed that IDN can fnd an approxmate soluton x k satsfyng F x k F ɛ n at most O F x 0 F + log ɛ 1 teratons. In addton, Tran-Dnh et al. [33] recently proposed a proxmal damped Newton PDN method and a proxmal Newton method for solvng 1.1. These methods are almost the same as the aforementoned DN and the mxture of DN and Newton except that d k s chosen as the followng proxmal Newton drecton: d k = arg mn d { fx k + fx k, d + 1 } 2 d, 2 fx k d + gx k + d

4 It has essentally been shown n [33, Theorems 6, 7] that the PDN and the proxmal Newton method can fnd an approxmate soluton x k satsfyng λ k ɛ n at most O F x 0 F + log log ɛ 1 teratons, where λ k = d k T 2 fx k d k. Ths complexty was derved smlarly as for the DN and the mxture of DN and Newton by consderng the two phases mentoned above. Besdes, proxmal gradent type methods and proxmal Newton type methods have been proposed n the lterature for solvng a class of composte mnmzaton problems n the form of 1.1 e.g., see [1, 25, 8, 3, 12]. At each teraton, proxmal gradent type methods requre the gradent of f whle proxmal Newton type methods need to access the Hessan of f or ts approxmaton. Though the proxmal Newton type methods [3, 12] are applcable to solve 1.1, they typcally requre a lnear search procedure to determne a sutable step length, whch may be expensve for solvng large-scale problems. In ths paper we are only nterested n a lne-search free method for solvng problem 1.1. It s known from [33] that PDN has a better teraton complexty than the accelerated proxmal gradent methods [1, 25]. The cost per teraton of PDN s, however, generally much hgher because t computes the search drecton d k accordng to 1.4 that nvolves 2 fx k. Ths can brng an enormous challenge to PDN for solvng large-scale problems. Inspred by the recent success of block coordnate descent methods, block proxmal gradent methods and block quas-newton type methods e.g., see [2, 5, 7, 11, 13, 14, 17, 18, 21, 22, 26, 28, 29, 30, 31, 34, 36, 37] for solvng large-scale problems, we propose a randomzed block proxmal damped Newton RBPDN method for solvng 1.1 wth n gx = g x, 1.5 where each x denotes a subvector of x wth dmenson N, {x : = 1,..., n} form a partton of the components of x, and each g : R N R s a proper closed convex functon. Brefly speakng, suppose that p 1,..., p n > 0 are a set of probabltes such that p = 1. Gven a current terate x k, we randomly choose ι {1,..., n} wth probablty p ι. The next terate x k+1 s obtaned by settng x k+1 j = x k j for j ι and x k+1 ι = x k ι + d ιx k 1 + λ ι x k, where d ι x k s an approxmate soluton to the subproblem mn d ι { fx k + ι fx k, d ι d ι, 2 ιιfx k, d ι + g ι x k ι + d ι }, 1.6 λ ι x k = d ι x k, 2 ιιfx k d ι x k, and ι fx k and 2 ιιfx k are respectvely the subvector and the submatrx of fx k and 2 fx k correspondng to x ι. In contrast wth the full PDN [33], the cost per teraton of RBPDN can be consderably lower because: only the submatrx 2 ιιfx k rather than the full 2 fx k needs to be accessed and/or evaluated; and the dmenson of subproblem 1.6 s much smaller than that of 1.4 and thus the computatonal cost for solvng 1.6 can also be substantally lower. In addton, compared to the randomzed block accelerated proxmal gradent RBAPG method [7, 17], RBPDN utlzes the entre curvature nformaton n the random subspace.e., 2 ιιfx k whle RBAPG only uses the partal curvature nformaton, partcularly, the extreme egenvalues of 2 ιιfx k. It s thus expected that RBPDN takes less number of teratons than RBAPG for fndng an approxmate soluton of smlar qualty, whch s ndeed demonstrated n our numercal experments. Overall, RBPDN can be much faster than RBAPG, provded that the subproblem 1.6 s effcently solved. The convergence of RBPDN s analyzed n ths paper. In partcular, we show that when g s Lpschtz contnuous n Sx 0 := {x : F x F x 0 }, 1.7 4

5 RBPDN s globally convergent, that s, E[F x k ] F as k. It s also shown that RBPDN enjoys a local lnear convergence. Moreover, we establsh a global lnear rate of convergence for a class of g ncludng the case where g s smooth but not necessarly self-concordant and g s Lpschtz contnuous n Sx 0, 1 that s, for some q 0, 1, E[F x k F ] q k F x 0 F, k 0. Notce that the DN [24] and PDN [33] are a specal case of RBPDN wth n = 1. As a consequence, we obtan a global lnear rate of convergence for the classcal damped Newton methods [24, 42] and the PDN [33] for such g, whch was prevously unknown n the lterature. Moreover, ths result can be used to sharpen the exstng teraton complexty of the frst phase of DN [24], IDN [42], PDN [33], the proxmal Newton method [33] and the mxture of DN and Newton [24]. The rest of ths paper s organzed as follows. In Subsecton 1.1, we present some assumpton, notaton and also some known facts. In Secton 2 we propose a RBPDN method for solvng problem 1.1 n whch g s n the form of 1.5. In Secton 3, we provde some techncal prelmnares. The convergence analyss of RBPDN s gven n Secton 4. Numercal results are presented n Secton 5. Fnally, n the appendx we dscuss how to solve the subproblems of RBPDN. 1.1 Assumpton, notaton and facts Throughout ths paper, we make the followng assumpton for problem 1.1. Assumpton 1 f s a standard self-concordant functon 2 and g s n the form of f s contnuous and postve defnte n the doman of F. Problem 1.1 has a unque optmal soluton x. Let R N denote the Eucldean space of dmenson N that s equpped wth the standard nner product,. For every x R N, let x denote a subvector of x wth dmenson N, where {x : = 1,..., n} form a partcular partton of the components of x. denotes the Eucldean norm of a vector or the spectral norm of a matrx. The local norm and ts dual norm at any x domf are gven by It s easy to see that u x := u, 2 fxu, v x := v, 2 fx 1 v, u, v R N. u, v u x v x, u, v R N. 1.8 For any {1,..., n}, let 2 fx denote the submatrx of 2 fx correspondng to the subvector x. The local norm and ts dual norm of x restrcted to the subspace of x are defned as y x := y, 2 fxy, z x := z, 2 fx 1 z, y, z R N. 1.9 In addton, for any symmetrc postve defnte matrx M, the weghted norm and ts dual norm assocated wth M are defned as u M := u, Mu, v M := v, M 1 v For example, gx = N x α wth α 2. It s not hard to verfy that g s s Lpschtz contnuous n any compact set, and moreover, g s not self-concordant when α > 2. 2 It follows from [24, Corollary 4.1.2] that f f s self-concordant wth parameter M f, then Mf 2 f/4 s a standard self-concordant functon. Therefore, problem 1.1 can be rescaled nto an equvalent problem for whch Assumpton 1 holds. 5

6 It s clear that u, v u M v M The followng two functons have played a crucal role n studyng some propertes of a standard self-concordant functon e.g., see [24]: ωt = t ln1 + t, ω t = t ln1 t It s not hard to observe that ωt 0 for all t > 1 and ω t 0 for every t < 1, and moreover, ω and ω are strctly ncreasng n [0, and [0, 1, respectvely. In addton, they are conjugate of each other, whch mples that for any t 0 and τ [0, 1, ωt = tω t ω ω t, ωt + ω τ τt 1.13 e.g., see [24, Lemma 4.1.4]. It s known from [24, Theorems 4.1.7, 4.1.8] that f satsfes: fy fx + fx, y x + ω y x x, x domf, y; 1.14 fy fx + fx, y x + ω y x x x, y domf, y x x < Randomzed block proxmal damped Newton method In ths secton we propose a randomzed block proxmal damped Newton RBPDN method for solvng problem 1.1 n whch g s n the form of 1.5. RBPDN method for solvng 1.1: Choose x 0 domf, η [0, 1/4], and p > 0 for = 1,..., n such that n p = 1. Set k = 0. 1 Pck ι {1,..., n} randomly wth probablty p ι. 2 Fnd an approxmate soluton d ι x k to the subproblem whch satsfes mn d ι { fx k + ι fx k, d ι d ι, 2 ιιfx k, d ι + g ι x k ι + d ι }, 2.1 v ι ι fx k + 2 ιιfx k d ι x k + g ι x k ι + d ι x k, 2.2 v ι x k ι η d ι x k x k ι 2.3 end for some v ι. 3 Set x k+1 j = x k j for j ι, xk+1 ι = x k ι + d ι x k /1 + λ ι x k, k k + 1 and go to step 1, where λ ι x k = d ι x k, 2 ιιfx k d ι x k. Remark: The constant η controls the nexactness of solvng subproblem 2.1. Clearly, d ι x k s the optmal soluton to 2.1 f η = 0. 6

7 For varous g, the above d ι x k can be effcently found. For example, when g = 0, d ι x k can be computed by conjugate gradent method. For a more general g, a numercal scheme s proposed n the appendx for fndng d ι x k, whch frst fnds a sutable approxmate soluton z of 2.1 and then obtans d ι x k by applyng a proxmal step to 2.1 at z. For many g, such z can be found by numerous methods, and also the proxmal step to 2.1 at z has a closed-form soluton or can be effcently computed. For example, when g = l1, z can be found by the methods n [1, 25, 10, 35, 38, 40, 39, 23, 4, 20] and the proxmal step to 2.1 at z has a closed-form soluton. To verfy 2.3, one has to compute v ι x, whch can be expensve snce 2 ιιfx k 1 s k ι nvolved. Alternatvely, we may replace 2.3 by a relaton that can be cheaply verfed and also ensures 2.3. Indeed, as seen later, the sequence {x k } les n the compact set Sx 0 and 2 fx s postve defnte for all x Sx 0. It follows that σ f := mn λ mn 2 fx 2.4 x Sx 0 s well-defned and postve, where λ mn denotes the mnmal egenvalue of the assocated matrx. One can observe from 1.9 and 2.4 that v ι x k ι = v T ι 2 ιιfx k 1 v ι v ι σf. It follows that f v ι η σ f d ι x k x k ι holds, so does 2.3. Therefore, for a cheaper computaton, one can replace 2.3 by v ι η σ f d ι x k x k ι, provded that σ f s known or can be bounded from below. v The convergence of RBPDN wll be analyzed n Secton 4. In partcular, we show that f g s Lpschtz contnuous n Sx 0, then RBPDN s globally convergent. It s also shown that RBPDN enjoys a local lnear convergence. Moreover, we establsh a global lnear rate of convergence for a class of g ncludng the case where g s smooth but not necessarly self-concordant and g s Lpschtz contnuous n Sx 0. 3 Techncal prelmnares In ths secton we establsh some techncal results that wll be used later to study the convergence of RBPDN. For any x domf, let ˆdx be an nexact proxmal Newton drecton, whch s an approxmate soluton of { mn fx + fx, d + 1 } d 2 d, 2 fxd + gx + d 3.1 satsfyng ˆv x η ˆdx x and ˆv fx + 2 fx ˆdx + gx + ˆdx 3.2 for some ˆv and η [0, 1/4]. The followng theorem provdes some reducton on the objectve value resulted from an nexact proxmal damped Newton step. 7

8 Lemma 3.1 Let x domf and ˆdx be defned above wth η [0, 1/4]. Then F x + ˆd 1 + ˆλ F x 1 2 ωˆλ, where ˆd = ˆdx and ˆλ = ˆdx x. Proof. By the defnton of ˆd and ˆλ, one can observe that It then follows from 1.15 that f x + ˆd 1 + ˆλ ˆd x /1 + ˆλ = ˆλ/1 + ˆλ < 1. In vew of 3.2 and ˆd = ˆdx, there exsts s gx + ˆd such that fx + 1 ˆλ 1 + ˆλ fx, ˆd + ω 1 + ˆλ. 3.3 fx + 2 fx ˆd + ˆv + s = By the convexty of g, one has g x + ˆd gx + ˆd ˆλgx 1 + ˆλ ˆλ 1 + ˆλ ˆλ [gx + s, ˆd ] + ˆλgx s, ˆd = gx ˆλ 1 + ˆλ Summng up 3.3 and 3.5, and usng 3.4, we have F x + ˆd 1 + ˆλ F x + 1 ˆλ 1 + ˆλ fx + s, ˆd + ω 1 + ˆλ = F x + 1 ˆλ 1 + ˆλ 2 fx ˆd ˆv, ˆd + ω 1 + ˆλ F x ˆλ 2 ˆλ ˆλ ˆλ 1 + ˆλ v x + ω 1 + ˆλ, 3.6 where the last relaton s due to the defnton of ˆλ and In addton, observe from 1.12 that ω ˆλ = ˆλ/1 + ˆλ. It follows from ths and 1.13 that ˆλ 2 ˆλ 1 + ˆλ + ω 1 + ˆλ = ˆλω ˆλ + ω ω ˆλ = ωˆλ, whch along wth 3.6, ˆv x η ˆd x and ˆλ = ˆd x mples F x + ˆd 1 + ˆλ F x ωˆλ + ηˆλ ˆλ We clam that for any η [0, 1/4], ηˆλ ˆλ 1 2 ωˆλ

9 Indeed, let φλ = 1 2 ωλ1 + λ ηλ2. In vew of ω λ = λ/1 + λ, 1.12 and η [0, 1/4], one has that for every λ 0, [ ] φ λ = 1 2 [ω λ1 + λ + ωλ] 2ηλ = 1 λ 2 1+λ 1 + λ + λ ln1 + λ 2ηλ = 1 2ηλ 1 2 ln1 + λ 1 2 [λ ln1 + λ] = 1 2ωλ 0. Ths together wth φ0 = 0 mples φλ 0. Thus 3.8 holds as clamed. The concluson of ths lemma then mmedately follows from 3.7 and 3.8. Remark: Some specal cases of Lemma 3.1 are already consdered n the lterature. In partcular, Tran-Dnh et al. establshed a smlar result n [32, Theorem 5] for the case where ˆdx s the exact soluton of problem 3.1, that s, ts assocated ˆv = 0. In addton, Zhang and Xao derved an analogous result for the case g = 0 n [42, Theorem 1]. We next provde some lower and upper bounds on the optmalty gap, whch s an extenson of the result [24, Theorem 4.1.3] for the case where g = 0. Lemma 3.2 Let x domf and λx be defned as Then λx := where the second nequalty s vald only when λx < 1. mn s F x s x. 3.9 ω x x x F x F ω λx, 3.10 Proof. Snce x s the optmal soluton of problem 1.1, we have fx gx. Ths together wth the convexty of g mples gx gx + fx, x x. Also, by 1.14, one has fx fx + fx, x x + ω x x x. Summng up these two nequaltes yelds the frst nequalty of Suppose λx < 1. We now prove the second nequalty of Indeed, by 1.14, one has fy fx + fx, y x + ω y x x, y. By 3.9, there exsts s F x such that s x = λx < 1. Clearly, s fx gx. In vew of ths and the convexty of g, we have gy gx + s fx, y x, y. Summng up these two nequaltes gves F y F x + s, y x + ω y x x, y. It then follows from ths, 1.8 and 1.13 that F = mn F y mn {F x + s, y x + ω y x x }, y y mn {F x s x y x x + ω y x x }, y F x ω s x = F x ω λx, where the last nequalty uses Thus the second nequalty of 3.10 holds. 9

10 For further dscusson, we denote by dx and λx the exact proxmal Newton drecton and ts local norm at x domf, that s, dx := { arg mn fx + fx, d + 1 } d 2 d, 2 fxd + gx + d, 3.11 λx := dx x The followng result provdes an estmate on the reducton of the objectve value resulted from the exact proxmal damped Newton step. Lemma 3.3 Let x domf, dx and λx be defned respectvely n 3.11 and 3.12, and x = x + dx/1 + λx. Then F x F x ω λx, 3.13 F x F ω λx Proof. The relaton 3.13 follows from [33, Theorem 5]. In addton, the relaton 3.14 holds due to 3.13 and F x F. Throughout the remander of the paper, let d x be an approxmate soluton of the problem { mn fx + fx, d + 1 } d 2 d, 2 fx, d + g x + d, 3.15 whch satsfes the followng condtons: for some v and η [0, 1/4]. Defne v fx + 2 fxd x + g x + d x, 3.16 v x η d x x 3.17 dx := d 1 x,..., d n x, v := v 1,..., v n, 3.18 λ x := d x x, = 1,..., n, 3.19 Hx := Dag 2 11fx,..., 2 nnfx, 3.20 where Hx s a block dagonal matrx, whose dagonal blocks are 2 11fx,..., 2 nnfx. It then follows that fx + v + Hxdx gx + dx The followng result bulds some relatonshp between dx Hx and n λ x. Lemma 3.4 Let x domf, dx, λ x and Hx be defned n 3.18, 3.19 and 3.20, respectvely. Then 1 n n λ x dx Hx λ x n Proof. By 1.9, 1.10, 3.18 and 3.20, one has dx Hx = n 2 fx 1 2 d x 2 1 n n 2 fx 1 2 d x = 1 n n λ x, 10

11 n n λ x = 2 fx 1 2 d x n 2 fx 1 2 d x 2 = dx Hx. The followng lemma bulds some relatonshp between dx Hx and dx x. Lemma 3.5 Let x domf, dx, dx and Hx be defned n 3.11, 3.18 and 3.20, respectvely. Then dx Hx dx x 1 + η Hx fx Hx fx 1 2, η dx x 1 + η Hx fx Hx fx 1 2 dx Hx Proof. For convenence, let d = dx, d = dx, H = Hx and H = 2 fx. Then t follows from 3.21 and 3.11 that fx + v + Hd gx + d, fx + H d gx + d. In vew of these and the monotoncty of g, one has d d, v Hd + H d 0, whch together wth 1.10 and 1.11 mples that d 2 H + d 2 H v, d d + d, H + H d Notce that v H d H + d H + d H d H H 1 2 H + H H d H H 1 2 H 1 2 d H Let H = 2 fx. Observe that v H = v x and d H = d x. These and 3.17 yeld v H η d H. In vew of ths and 3.20, one has v H = v H 2 η 2 d 2 H = η d H It follows from ths, 3.25 and 3.26 that d 2 H + d 2 H η d H d H + H 1 2 H 1 2 d H + d H d H H 1 2 H + H H 1 2, η d 2 H η H 1 2 H H H 2 d H d H, 3.28 where the second nequalty uses the relaton Clearly, 3.28 s equvalent to H 1 2 H + H H 1 2 H 1 2 H H 1 2 H η d 2 H + d 2 H 1 + η H 1 2 H H 1 2 H 1 2 d H d H. Ths, along wth d = dx, d = dx, H = Hx, H = 2 fx and d x = d H, yelds 3.23 and The followng results wll be used subsequently to study the convergence of RBPDN. 11

12 Lemma 3.6 Let Sx 0, σ f, dx, dx, λ x and Hx be defned n 1.7, 2.4, 3.11, 3.18, 3.19 and 3.20, respectvely. Then Sx 0 s a nonempty convex compact set. v v where where x x 2L f /σ f dx, x Sx 0, 3.29 L f = F x F ω max x Sx 0 2 fx n c 1 λ x, x Sx 0, 3.31 c 1 = n max 1 η { }. 1 + η Hx fx Hx fx x Sx 0 dx dx 1 η c 1 nσf dx Hx, x Sx η c 1 nσf n λ x, x Sx Proof. Clearly, Sx 0 due to x 0 Sx 0. By 1.7 and the frst nequalty of 3.10, one can observe that Sx 0 { x : ω x x x F x 0 F }. Ths together wth the strct monotoncty of ω n [0, mples that Sx 0 s a bounded set. In addton, we know that F s a closed convex functon. Hence, Sx 0 s closed and convex. By Assumpton 1, we know that 2 f s contnuous and postve defnte n domf. It follows from ths and the compactness of Sx 0 that σ f and L f are well-defned n 2.4 and 3.30 and moreover they are postve. For convenence, let d = dx and H = 2 fx. By the optmalty condton of 1.1 and 3.11, one has fx + H d gx + d, fx gx, whch together wth the monotoncty of g yeld Hence, we have that for all x Sx 0, x + d x, fx H d + fx 0. σ f x x 2 x x, fx fx d, fx fx x x, H d fx fx d + H x x d 2L f x x d, whch mmedately mples In vew of 3.12, 3.22, 3.23 and 3.32, one can observe that λx = dx x c 1 n 12 λ x, x Sx 0,

13 whch, together wth 3.14 and the monotoncty of ω n [0,, mples that 3.31 holds. v One can observe that dx 2 fx 1 2 dx x 1 dx x, x Sx 0, 3.35 σf where the last nequalty s due to 2.4. Ths, 3.24 and 3.32 lead to v The relaton 3.34 follows from 3.22 and Convergence results In ths secton we establsh some convergence results for RBPDN. In partcular, we show n Subsecton 4.1 that f g s Lpschtz contnuous n Sx 0, then RBPDN s globally convergent. In Subsecton 4.2, we show that RBPDN enjoys a local lnear convergence. In Subsecton 4.3, we show that for a class of g ncludng the case where g s smooth but not necessarly self-concordant and g s Lpschtz contnuous n Sx 0, RBPDN enjoys a global lnear convergence. Fnally, n Subsecton 4.4 we specalze ths result to some PDN methods and mprove ther exstng teraton complexty. 4.1 Global convergence In ths subsecton we study the global convergence of RBPDN. To proceed, we frst establsh a certan reducton on the objectve values over every two consecutve teratons. Lemma 4.1 Let {x k } be generated by RBPDN. Then E ι [F x k+1 ] F x k 1 2 ω n p mn λ x k, k 0, 4.1 where λ s defned n 3.19 and p mn := mn 1 n p. 4.2 Proof. Recall that ι {1,..., n} s randomly chosen at teraton k wth probablty p ι. Snce f s a standard self-concordant functon, t s not hard to observe that fx k 1,..., x k ι 1, z, x k ι+1,..., x k n s also a standard self-concordant functon of z. In vew of ths and Lemma 3.1 wth F replaced by F x k 1,..., x k ι 1, z, x k ι+1,..., x k n, one can obtan that F x k+1 F x k 1 2 ωλ ιx k. 4.3 Takng expectaton wth respect to ι and usng the convexty of ω, one has n n E ι [F x k+1 ] F x k 1 2 p ωλ x k F x k 1 2 ω p λ x k F x k 1 2 ω n λ x k, p mn where the last nequalty follows from 4.2 and the monotoncty of ω n [0,. We next establsh global convergence of RBPDN by consderng two cases n = 1 or n > 1 separately as the latter case requres some mld assumpton. For the case n = 1, RBPDN reduces to an nexact PDN method, whch ncludes the exact PDN method [33] as a specal case. Though 13

14 the local convergence of the exact PDN method s well establshed n [33], the study of ts global convergence s rather lmted there. In fact, the authors of [33] only establshed a smlar result as the one n Lemma 4.1 wth n = 1 see [33, Theorem 5], but they dd not establsh the global convergence results such as F x k F or x k x as k. We next establsh such results for RBPDN wth n = 1, namely, the nexact PDN. Theorem 4.1 Let {x k } be the sequence generated by RBPDN wth n = 1. Then lm k xk = x, lm F k xk = F. Proof. Snce {x k } s generated by RBPDN wth n = 1, t follows from 4.1 wth n = 1 that F x k+1 F x k ω λx k /2, k 0, 4.4 where λx = dx x and dx s defned n Summng up these nequaltes and usng the fact F x k F for all k 0, we have 0 k ωλxk 2F x 0 F. Notce from 1.12 that ωt 0 for all t 0 and ωt = 0 f and only f t = 0. These mply that lm k λx k = 0, that s, lm k dx k x k = 0. In vew of x 0 Sx 0 and 4.4, one can observe that {x k } Sx 0. Usng ths, 2.4 and 3.30, we have dx k 2 fx k 1/2 dx k x k σ 1 f dxk x k, whch along wth lm k dx k x k = 0 yelds lm k dx k = 0. In addton, t follows from 2.2, 2.3 and 3.18 wth n = 1 that there exsts some v k such that for all k, v k fx k + 2 fx k dx k + gx k + dx k, v k x k η dxk x k. 4.5 Recall that {x k } Sx 0 domf and lm k dx k x k = 0. These and 1.15 mples that for suffcently large k, fx k + dx k fx k + fx k, dx k + ω dx k x k. By ths relaton, {x k } Sx 0, lm k dx k = 0, lm k dx k x k = 0 and the fact that Sx 0 s a compact set, there exsts some constant C such that fx k + dx k C and fx k C, that s, x k, x k + dx k Ω = {x : fx C} for suffcently large k. It s not hard to observe from Assumpton 1 that Ω s a nonempty convex compact set and fx k fx k + dx k L f dx k for suffcently large k, where L f = max{ 2 fx : x Ω}. Ths and 3.30 mply that for suffcently large k, v k fx k 2 fx k dx k + fx k + dx k v k + L f + L f dx k. By ths relaton and 4.5, there exsts some s k F x k + dx k such that for suffcently large k, s k v k + L f + L f dx k η L f dx k x k + L f + L f dx k, where the last nequalty follows from v k 2 fx k 1/2 v k x k, 3.30 and the second relaton of 4.5. It thus follows from lm k dx k x k = 0 and lm k dx k = 0 that lm k s k = 0. Recall that {x k } Sx 0 and Sx 0 s a compact set. Let x be an arbtrary accumulaton pont of {x k }. Then there exsts a subsequence K such that lm K k x k = x, whch along wth lm k dx k = 0 mples lm K k x k + dx k = x. Ths along wth s k F x k + dx k and lm k s k = 0 yelds 0 F x. Hence, x s an optmal soluton of problem 1.1. It then follows from Assumpton 1 that x = x. Therefore, x k x and F x k F as k. In what follows, we establsh global convergence of RBPDN wth n > 1 under some mld assumpton. 14

15 Theorem 4.2 Let {x k } be the sequence generated by the RBPDN wth n > 1. Assume that g s Lpschtz contnuous n Sx 0. Then lm E[F k xk ] = F. Proof. It follows from 4.1 that E[F x k+1 ] E[F x k ] 1 2 [ω E n [ n E[F x k ] 1 2 p ω mn E p mn ] λ x k λ x k where the last relaton follows from Jensen s nequalty. Hence, we have 0 [ n ] ω p mn E λ x k 2F x 0 F. k Usng ths and a smlar argument as n the proof of Theorem 4.2, we obtan [ n ] lm E λ x k = k In vew of x 0 Sx 0 and 4.3, one can observe that x k Sx 0 for all k 0. Due to the contnuty of f and the compactness of Sx 0, one can observe that f s Lpschtz contnuous n Sx 0. Ths along wth the assumpton of Lpschtz contnuty of g n Sx 0 mples that F s Lpschtz contnuous n Sx 0 wth some Lpschtz constant L F 0. Usng ths, 3.29 and 3.34, we obtan that for all k 0, F x k F + L F x k x F + 2L f L F σ f F + 21 ηl f L F n 3/2 λ x k, c 1 nσ f ], dx k where the last two nequaltes follow from 3.29 and 3.34, respectvely. Ths together wth 4.6 and F x k F mples that the concluson holds. 4.2 Local lnear convergence In ths subsecton we show that RBPDN enjoys a local lnear convergence. Theorem 4.3 Let {x k } be generated by RBPDN. Suppose F x 0 F +ωc 1 /p mn, where c 1 and p mn are defned n 3.32 and 4.2, respectvely. Then [ E[F x k F 12c2 + p 2 ] k mn 1 θ ] 12c 2 + p 2 F x 0 F, k 0, mn where c 2 := θ [ L f σ f p max := max 1 n p, ] 3/2 21 η c 1 n 2 + η p max, 4.7 θ := mn nf 1 n x Sx 0 p 0, 1, λ x and σ f, L f and c 1 are defned respectvely n 2.4, 3.30 and

16 Proof. Let k 0 be arbtrarly chosen. For convenence, let x = x k and x + = x k+1. By the updatng scheme of x k+1, one can observe that x + j = x j for j ι and x + ι = x ι + d ιx 1 + λ ι x, where ι {1,..., n} s randomly chosen wth probablty p ι and d ι x s an approxmate soluton to problem 3.15 that satsfes 3.16 and 3.17 for some v ι and η [0, 1/4]. To prove ths theorem, t suffces to show that E ι [F x + F 12c2 + p 2 mn 1 θ ] 12c 2 + p 2 F x F. 4.9 mn To ths end, we frst clam that θ s well-defned n 4.8 and moreover θ 0, 1. Indeed, gven any {1,..., n}, let y R N be defned as follows: y = x + d x 1 + λ x, y j = x j, j, where λ s defned n By a smlar argument as for 4.3, one has F y F x 1 2 ωλ x. Usng ths, x Sx 0, F y F and the monotoncty of ω 1, we obtan that λ x ω 1 2[F x F y] ω 1 2[F x 0 F ], where ω 1 s the nverse functon of ω when restrcted to the nterval [0,. 3 It thus follows that θ s well-defned n 4.8 and moreover θ 0, 1. For convenence, let λ = λ x, d = d x and H = 2 fx for = 1,..., n and H = DagH 1,..., H n. In vew of x Sx 0 and 3.30, one can observe that whch along wth 3.29 and 3.33 mples H 2 fx L f, x x H H 1/2 x x 2L 3/2 f /σ f dx, 3/2 L f 1 η 2 d H c 1 n It follows from 3.16 that there exsts s g x + d such that σ f fx + H d + s + v = 0, = 1,..., n, 4.11 whch together wth the defnton of H and v yelds where s = s 1,..., s n gx + d. By the convexty of f, one has fx + Hd + s + v = 0, fx fx + fx, x x. 3 Observe from 1.12 that ω s strctly ncreasng n [0,. Thus, ts nverse functon ω 1 s well-defned when restrcted to ths nterval and moreover t s strctly ncreasng. 16

17 In addton, by s gx + d and the convexty of g, one has gx + d gx + s, x + d x. Usng the last three relatons, 3.27 and 4.10, we can obtan that fx + fx + v, d + gx + d fx + fx, x x + fx + v, d + gx + s, x + d x = F + fx + v + s, x + d x v, x x = F + Hd, x + d x v, x x = F Hd, d Hd, x x v, x x F d 2 H + d H x x H + v H x x H F + β d 2 H, 4.12 where By 3.17 and 4.8, we have β = L f σ f 3/2 21 η 2 c 1 n p v, d 1 + λ p 1 + λ v H d H η p 1 + λ d 2 H η p max d 2 H In addton, recall that ω t = t ln1 t. It thus follows that t k ω t = k t2 2 k=2 t k = k=0 t 2, t [0, t Ths nequalty mples that λ p ω 1 + λ p λ /1 + λ 2 21 λ /1 + λ = 1 2 p λ 2 p max 1 + λ 2 λ 2 = p max 2 d 2 H, 4.15 where p max s defned n 4.8. Recall that s g x + d. By the convexty of g, one has g x + d g x s, d. It thus follows from ths and 4.11 that for = 1,..., n, fx + v, d + g x + d g x fx + v, d + s, d = fx + s + v, d = d, H d By a smlar argument as for 3.3 and the defnton of x +, one has fx + fx λ ι ι fx, d ι + ω λι 1 + λ ι It also follows from the convexty of g ι that g ι x ι + d ι g ι x ι 1 [g ι x ι + d ι g ι x ι ]. 1 + λ ι 1 + λ ι. 17

18 Usng the last two nequaltes and the defnton of x +, we have F x + = fx + + g ι x ι + dι 1+λ ι + g j x j j ι = fx + + gx + g ι x ι + dι 1+λ ι g ι x ι fx + 1 λ 1+λ ι ι fx, d ι + ω ι 1+λ ι + gx + g ι x ι + dι 1+λ ι g ι x ι = F x λ ι ι fx, d ι + ω F x λ ι ι fx, d ι + ω λ ι 1+λ ι λ ι 1+λ ι + g ι x ι + dι 1+λ ι g ι x ι λ ι [g ι x ι + d ι g ι x ι ] 1+λ ι + ω = F x λ ι [ ι fx + v ι, d ι + g ι x ι + d ι g ι x ι ] vι,dι λ ι 1+λ ι Takng expectaton wth respect to ι on both sdes and usng 4.8, 4.12, 4.14, 4.15 and 4.16, one has E ι[f x + ] F x + p [ fx + v, d + g x + d g x ] p v, d + λ p ω 1 + λ }{{} 1 + λ 1 + λ 0 due to 4.16 F x + θ [ fx + v, d + g x + d g x ] p v, d + λ p ω 1 + λ 1 + λ = F x + θ [ fx + v, d + gx + d gx] p v, d + λ p ω 1 + λ 1 + λ = 1 θf x + θ [fx + fx + v, d + gx + d] p v, d + λ p ω 1 + λ 1 + λ 1 θf x + θf + β d 2 H + η p max d 2 H + pmax 2 d 2 H = 1 θf x + θf + θβ + 1/2 + ηp max d 2 H. 1 θf x + θf + c 2 λ 2, 4.17 where the last nequalty s due to 4.13, 4.7 and d 2 H = λ2 λ 2. One can easly observe from 4.17 that the concluson of ths theorem holds f c 2 = 0. We now assume c 2 > 0. Let δ + = F x + F and δ = F x F. It then follows from 4.17 that E ι [δ + ] 1 θδ + c 2 λ 2, whch yelds 2 λ 1 Eι [δ + ] 1 θδ 4.18 c 2 By the assumpton, one has F x F x 0 F + ωc 1 /p mn. By ths and 3.31, we have ωc 1 λ F x F ωc 1 /p mn, whch together wth the monotoncty of ω n [0, mples p mn λ 1. Observe that 1 k t k ωt = t ln1 + t = k k=2 t2 2 t3 3 t2, t [0, 1]. 6 18

19 Ths and p mn λ 1 lead to It then follows from ths and 4.1 that whch together wth 4.18 gves Hence, we obtan that whch proves 4.9 as desred. 4.3 Global lnear convergence 2 ω p mn λ 1 6 p2 mn λ. 2 E ι [δ + ] δ 1 12 p2 mn λ, E ι [δ + ] δ p2 mn 12c 2 Eι [δ + ] 1 θδ. E ι [δ + 12c2 + p 2 mn 1 θ ] 12c 2 + p 2 δ, mn In ths subsecton we establsh a global lnear rate of convergence for RBPDN under the followng assumpton n addton to Assumpton 1. Assumpton 2 There exsts some c 3 > 0 such that dx c 3 λx, x Sx 0, where Sx 0, λx and dx are defned n 1.7, 3.9 and 3.11, respectvely. The followng proposton shows that Assumpton 2 holds for a class of g ncludng the case where g s smooth but not necessarly self-concordant and g s Lpschtz contnuous n Sx 0. 4 Proposton 4.1 Suppose that g s Lpschtz dfferentable n Sx 0 wth a Lpschtz constant L g 0. Then Assumpton 2 holds wth c 3 = σ f /L f + L g, where σ f and L f are defned n 2.4 and 3.30, respectvely. Proof. Let x Sx 0 be arbtrarly chosen. It follows from 3.11 and the dfferentablty of g that fx + 2 fx dx + gx + dx = 0, whch, together wth 3.9, 3.30 and the Lpschtz contnuty of g, mples that λx = fx + gx x 1 σf fx + gx, = 1 and hence the concluson holds. σf gx gx + dx 2 fx dx L f +L g dx. We next provde a lower bound for λx n terms of the optmalty gap, whch wll play crucal role n our subsequent analyss. 4 Ths covers the case where g = 0, whch, for nstance, arses n the nteror pont methods for solvng smooth convex optmzaton problems. σf 19

20 Lemma 4.2 Let x domf and λx be defned n 3.9. Then λx ω 1 F x F, 4.19 where ω 1 s the nverse functon of ω when restrcted to the nterval [0, 1. Proof. Observe from 1.12 that ω t [0, for t [0, 1 and ω s strctly ncreasng n [0, 1. Thus ts nverse functon ω 1 s well-defned when restrcted to ths nterval. It also follows that ω 1 t [0, 1 for t [0, and ω 1 s strctly ncreasng n [0,. We dvde the rest of the proof nto two separable cases as follows. Case 1: λx < 1. It follows from Lemma 3.2 that F x F ω λx. Takng ω 1 on both sdes of ths relaton and usng the monotoncty of ω 1, we see that 4.19 holds. Case 2: λx clearly holds n ths case due to ω 1 t [0, 1 for all t 0 In what follows, we show that under Assumpton 2 RBPDN enjoys a global lnear convergence. Theorem 4.4 Let {x k } be generated by RBPDN. Suppose that Assumpton 2 holds. Then E[F x k F ] [1 c2 4p 2 ] k mn 1 ω 1 δ c 4 p mn ω 1 F x 0 F, k 0, δ 0 where δ 0 = F x 0 F, and σ f and c 1 are defned n 2.4 and 3.32, respectvely. c 4 = c 1c 3 nσf, η Proof. Let k 0 be arbtrarly chosen. For convenence, let x = x k and x + = x k+1. By the updatng scheme of x k+1, one can observe that x + j = x j for j ι and x + ι = x ι + d ιx 1 + λ ι x, where ι {1,..., n} s randomly chosen wth probablty p ι and d ι x s an approxmate soluton to problem 3.15 that satsfes 3.16 and 3.17 for some v ι and η [0, 1/4]. To prove ths theorem, t suffces to show that E ι [F x + F ] [1 c2 4p 2 mn 1 ω c 4 p mn ω 1 δ 0 δ 0 Indeed, t follows from 3.34, 4.20 and Assumpton 2 that n Ths together wth 4.19 yelds λ x c 1 nσf 1 η dx c 4 λx. n λ x c 4 ω 1 F x F. Usng ths, 4.1 and the monotoncty of ω n [0,, we obtan that E ι [F x + ] F x 1 2 ω c 4 p mn ω 1 F x F. ] F x F

21 Let δ + = F x + F and δ = F x F. It then follows that E ι [δ + ] δ 1 2 ω c 4 p mn ω 1 δ Consder the functon t = ω 1 s. Then s = ω t. Dfferentatng both sdes wth respect to s, we have ω t dt ds = 1, whch along wth ω t = t ln1 t yelds ω 1 s = dt ds = 1 ω t = 1 t t = 1 ω 1 ω 1 In vew of ths and ωt = t ln1 + t, one has that for any α > 0, d ds [ωαω 1 s] = αω αω 1 sω 1 s = α αω 1 s 1 + αω 1 s. s Notce that δ δ 0 due to x Sx 0. By ths and the monotoncty of ω 1 whch mples that ω 1 s ω 1 δ ω 1 δ 0, s [0, δ], 1 ω 1 s 1 + αω 1 s 1 ω 1 δ αω 1, s [0, δ]. δ 0 Also, observe that ωαω 1 0 = 0. Usng these relatons and 4.23, we have δ ωαω 1 δ = 0 d δ ds [ωαω 1 s]ds = 0 α 2 1 ω 1 s 1 + αω 1 Ths and 4.22 wth α = c 4 p mn lead to E ι [δ + ] [1 c2 4p 2 ] mn 1 ω 1 δ c 4 p mn ω 1 δ, δ 0 whch gves 4.21 as desred. s 1 ω 1 s ω 1 = α2 1 ω 1 s s 1 + αω 1 s. 4.23, one can see that s ds α2 1 ω 1 δ αω 1 δ 0 δ. The followng result s an mmedate consequence of Proposton 4.1 and Theorem 4.4. Corollary 4.1 Let {x k } be generated by RBPDN. Suppose that g s Lpschtz dfferentable n Sx 0 wth a Lpschtz constant L g 0. Then E[F x k F ] where δ 0 = F x 0 F, [1 c2 4p 2 ] k mn 1 ω 1 δ c 4 p mn ω 1 F x 0 F, k 0, δ 0 c 4 = nc1 σ f 1 ηl f + L g, and σ f, L f and c 1 are defned n 2.4, 3.30 and 3.32, respectvely. 21

22 4.4 Convergence results for proxmal damped Newton methods In ths subsecton we specalze the convergence results n Subsecton 4.3 to some PDN methods [24, 42, 33] and mprove ther exstng teraton complexty. One can observe that RBPDN reduces to PDN [33] or DN [24] 5 by settng n = 1. It thus follows from Corollary 4.1 that PDN for a class of g and DN are globally lnearly convergent, whch s stated below. Theorem 4.5 Suppose that g s Lpschtz dfferentable n Sx 0. Then PDN [33] for such g and DN [24] when appled to problem 1.1 are globally lnearly convergent. In what follows, we show that Theorem 4.5 can be used to sharpen the exstng teraton complexty of some PDN methods presented n [24, 42, 33]. A mxture of DN and Newton methods s presented n [24, Secton 4.1.5] for solvng problem 1.1 wth g = 0. In partcular, ths method conssts of two stages. Gven an ntal pont x 0, β 0, 3 5/2 and ɛ > 0, the frst stage performs the DN teratons x k+1 = x k dx k 1 + λx k 4.24 untl fndng some x K1 such that λx K1 β, where d and λ are defned n 3.11 and 3.12, respectvely. The second stage executes the standard Newton teratons x k+1 = x k dx k, 4.25 startng at x K1 and termnatng at some x K2 such that λx K2 ɛ. As shown n [24, Secton 4.1.5], the second stage converges quadratcally: λx k+1 λx k 1 λx k 2, k K In addton, an upper bound on K 1 s establshed n [24, Secton 4.1.5], whch s K 1 F x 0 F /ωβ In vew of 4.26, one can easly show that log ɛ 2 log1 β K 2 K 1 log log β 2 log1 β Observe that the frst stage of ths method s just DN, whch s a specal case of RBPDN wth n = 1 and η = 0. It thus follows from Theorem 4.5 that the frst stage converges lnearly. In fact, t can be shown that 1 1 ω 1 F x k+1 F δ ω 1 δ 0 F x k F, k K 1, 4.29 where δ 0 = F x 0 F. Indeed, snce g = 0, one can observe from 3.9 and 3.12 that λx k = λx k. It then follows from ths, g = 0 and [24, Theorem ] that F x k+1 F x k ω λx k for all k K 1. Ths together wth 4.19 mples that 5 PDN becomes DN f g = 0. F x k+1 F x k ωω 1 F x k F, k K 1. 22

23 The relaton 4.29 then follows from ths and a smlar argument as n the proof of Theorem 4.4. Let K = logωβ log δ 0 log 1 1 ω 1 δ 0 1+ω 1 δ 0 where t + = maxt, 0. In vew of 4.29, one can easly verfy that F x K F ωβ, whch along wth 3.14 mples that λx K β. By 4.27 and the defnton of K 1, one can have K 1 mn { K, δ0 /ωβ }, whch sharpens the bound Combnng ths relaton and 4.28, we thus obtan the followng new teraton complexty for fndng an approxmate soluton of 1.1 wth g = 0 by a mxture of DN and Newton method [24, Secton 4.1.5]. Theorem 4.6 Let x 0 domf, β 0, 3 5/2 and ɛ > 0 be gven. Then the mxture of DN and Newton methods [24, Secton 4.1.5] for solvng problem 1.1 wth g = 0 requres at most mn logωβ log δ 0 δ0 log ɛ 2 log1 β, log 1 1 ω 1 ωβ + log 2 log β 2 log1 β δ 0 1+ω 1 δ 0 teratons for fndng some x k satsfyng λx k ɛ, where δ 0 = F x 0 F. + Recently, Zhang and Xao [42] proposed an nexact DN method for solvng problem 1.1 wth g = 0, whose teratons are updated as follows: x k+1 = x k ˆdx k 1 + ˆλx, k 0, k where ˆdx k s an approxmaton to dx k and ˆλx k = ˆdx k, 2 fx k ˆdx k see [42, Algorthm 1] for detals. It s shown n [42, Theorem 1] that such {x k } satsfes F x k+1 F x k 1 2 ω λx k, k 0, , ω λx k ω λx k, f λx k 1/6, 4.31 where λ s defned n These relatons are used n [42] for dervng an teraton complexty of the nexact DN method. In partcular, ts complexty analyss s dvded nto two parts. The frst part estmates the number of teratons requred for generatng some x K1 satsfyng λx K1 1/6, whle the second part estmates the addtonal teratons needed for generatng some x K2 satsfyng F x K2 F ɛ. In [42], the relaton 4.30 s used to show that K 1 2F x 0 F /ω1/6, 4.32 whle 4.31 s used to establsh K 2 K 1 2ω1/6 log ɛ It follows from these two relatons that the nexact DN method can fnd an approxmate soluton x k satsfyng F x k F ɛ n at most 2F x 0 F 2ω1/6 + log ω1/6 2 ɛ 23

24 teratons, whch s stated n [42, Corollary 1]. By a smlar analyss as above, one can show that the nexact DN method [42, Algorthm 1] s globally lnearly convergent. In fact, t can be shown that F x k+1 F 1 1 ω 1 δ ω 1 F x k F, k 0, 4.34 δ 0 where δ 0 = F x 0 F. Indeed, snce g = 0, one has λx k = λx k. It follows from ths, 4.19 and 4.30 that F x k+1 F x k 1 2 ωω 1 F x k F, k 0. The relaton 4.34 then follows from ths and a smlar dervaton as n the proof of Theorem 4.4. By 4.32, 4.34 and a smlar argument as above, one can have K 1 mn log 1 2 ω1/6 log δ 0 2δ0, log 1 1 ω 1 ω1/6, δ 0 21+ω 1 δ 0 whch mproves the bound Combnng ths relaton and 4.33, we thus obtan the followng new teraton complexty for fndng an approxmate soluton of 1.1 wth g = 0 by the aforementoned nexact DN method. Theorem 4.7 Let x 0 domf and ɛ > 0 be gven. Then the nexact DN method [42, Algorthm 1] for solvng problem 1.1 wth g = 0 requres at most mn log 1 2 ω1/6 log δ 0 2δ0 2ω1/6, log 1 1 ω 1 ω1/6 + log 2 ɛ δ 0 21+ω 1 δ 0 teratons for fndng some x k satsfyng F x k F ɛ, where δ 0 = F x 0 F. + Dnh-Tran et al. recently proposed n [33, Algorthm 1] a proxmal Newton method for solvng problem 1.1 wth general g. Akn to the aforementoned method [24, Secton 4.1.5] for 1.1 wth g = 0, ths method also conssts of two stages or phases. The frst stage performs the PDN teratons n the form of 4.24 for fndng some x K1 such that λx K1 ω0.2, whle the second stage executes the proxmal Newton teratons n the form of 4.25 startng at x K1 and termnatng at some x K2 such that λx K2 ɛ. As shown n [33, Theorem 6], the second stage converges quadratcally. The followng relatons are essentally establshed n [33, Theorem 7]: + K 1 F x 0 F /ω0.2, 4.35 K 2 K log log ɛ Throughout the remander of ths subsecton, suppose that Assumpton 2 holds. Observe that the frst stage of ths method s just PDN, whch s a specal case of RBPDN wth n = 1 and η = 0. It thus follows from Theorem 4.5 that the frst stage converges lnearly. In fact, t can be shown that F x k+1 F [1 ĉ2 1 ω 1 ] k δ ĉω 1 F x 0 F, k K 1, 4.37 δ 0 where δ 0 = F x 0 F, ĉ = c 3 σf, and σ f and c 3 are gven n 2.4 and Assumpton 2, respectvely. Indeed, by 3.12 and 3.35, one has dx k λx k / σ f. In addton, by Assumpton 2, we 24

25 have dx k c 3 λx k. It follows from these two relatons that λx k ĉ λx k, whch together wth 4.19 yelds λx k ĉω 1 F x k F. Ths and 3.13 mply that F x k+1 F x k ωĉω 1 F x k F, k K 1. The relaton 4.37 then follows from ths and a smlar argument as n the proof of Theorem 4.4. Let K = logω0.2 log δ 0 log 1 ĉ2 1 ω 1 δ 0 1+ĉω 1 δ 0 By 4.37, one can easly verfy that F x K F ω0.2, whch along wth 3.14 mples that λx K 0.2. By 4.27 and the defnton of K 1, one can have K 1 mn { K, δ0 /ω0.2 }, whch sharpens the bound Combnng ths relaton and 4.36, we thus obtan the followng new teraton complexty for fndng an approxmate soluton of 1.1 by the aforementoned proxmal Newton method. Theorem 4.8 Let x 0 domf and ɛ > 0 be gven. Suppose that Assumpton 2 holds. Then the proxmal Newton method [33, Algorthm 1] for solvng problem 1.1 requres at most mn logω0.2 log δ 0 δ0, log 1 ĉ2 1 ω 1 ω log log 0.28 ɛ δ 0 1+ĉω 1 δ 0 + teratons for fndng some x k satsfyng λx k ɛ, where δ 0 = F x 0 F, ĉ = c 3 σf, and σ f and c 3 are gven n 2.4 and Assumpton 2, respectvely. Remark: Suppose that g s Lpschtz dfferentable n Sx 0 wth a Lpschtz constant L g 0. It follows from Proposton 4.1 that Assumpton 2 holds wth c 3 = σ f /L f + L g, where L f s defned n 3.30, and thus Theorem 4.8 holds wth ĉ = σ f /L f + L g Numercal results In ths secton we conduct numercal experment to test the performance of RBPDN. In partcular, we apply RBPDN to solve a regularzed logstc regresson RLR model and a sparse regularzed logstc regresson SRLR model. We also compare RBPDN wth a randomzed block accelerated proxmal gradent RBAPG method proposed n [17] on these problems. All codes are wrtten n MATLAB and all computatons are performed on a MacBook Pro runnng wth Mac OS X Lon and 4GB memory. For the RLR problem, our goal s to mnmze a regularzed emprcal logstc loss functon, partcularly, to solve the problem: { L µ := mn x R N L µ x := 1 m } m log1 + exp y w, x + µ 2 x 2 for some µ > 0, where w R N s a sample of N features and y { 1, 1} s a bnary classfcaton of ths sample. Ths model has recently been consdered n [42]. Smlarly, for the SRLR problem, we am to solve the problem: { L γ,µ := mn x R N L γ,µ x := 1 m } m log1 + exp y w, x + µ 2 x 2 + γ x

26 for some µ, γ > 0. In our experments below, we fx m = 1000 and set N = 3000, 6000,..., For each par m, N, we randomly generate 10 copes of data {w, y } m ndependently. In each copy, the elements of w are generated accordng to the standard unform dstrbuton on the open nterval 0, 1 and y s generated accordng to the dstrbuton Pξ = 1 = Pξ = 1 = 1/2. As n [42], we normalze the data so that w = 1 for all = 1,..., m, and set the regularzaton parameters µ = 10 5 and γ = We now apply RBPDN and RBAPG to solve problem 5.1. For both methods, the decson varable x R N s dvded nto 10 blocks sequentally and equally. At each teraton k, they pck a block ι unformly at random. For RBPDN, t needs to fnd a search drecton d ι x k satsfyng 2.2 and 2.3 wth f = L µ and g = 0, that s, 2 ιιl µ x k d ι x k + ι L µ x k + v ι = 0, 5.3 v ι, 2 ιιl µ x k 1 v ι η d ι x k, 2 ιιl µ x k d ι x k 5.4 for some η [0, 1/4]. To obtan such a d ι x k, we apply conjugate gradent method to solve the equaton 2 ιιl µ x k d ι = ι L µ x k untl an approxmate soluton d ι satsfyng 2 ιιl µ x k d ι + ι L µ x k 1 4 µ d ι, 2 ιιl µ x k d ι. 5.5 s found and then set d ι x k = d ι. Notce from 5.1 that 2 ιιl µ x k µi. In vew of ths, one can verfy that such d ι x k satsfes 5.3 and 5.4 wth η = 1/4. In addton, we choose x 0 = 0 for both methods and termnate them once the dualty gap s below More specfcally, one can easly derve a dual of problem 5.1 gven by max s R m D µs := 1 m m log1 ms 1 2µ m 2 m s y w ms s log 1 ms. Let {x k } be a sequence of approxmate solutons to problem 5.1 generated by RBPDN or RBAPG and s k R m the assocated dual sequence defned as follows: s k = exp y w, x k m1 + exp y w, x k, = 1,..., m. 5.6 We use L µ x k D µ s k 10 3 as the termnaton crteron for RBPDN or RBAPG, whch s checked once every 10 teratons. The computatonal results averaged over the 10 copes of data generated above are presented n Table 1. In detal, the problem sze N s lsted n the frst column. The average number of teratons upon round off for RBPDN and RBAPG are gven n the next two columns. The average CPU tme n seconds for these methods are presented n columns four and fve, and the average objectve functon value of 5.1 obtaned by them are gven n the last two columns. One can observe that both methods are comparable n terms of objectve values, but RBPDN substantally outperforms RBAPG n terms of CPU tme. In the next experment, we apply RBPDN and RBAPG to solve problem 5.2. Smlarly as above, the decson varable x R N s dvded nto 10 blocks sequentally and equally. At each teraton k, they pck a block ι unformly at random. For RBPDN, t needs to compute a search 26

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?