Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization

Size: px

Start display at page:

Download "Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization"

Jeffry Stevens
5 years ago
Views:

1 The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8) Inexact Proxial Gradient Methods for Non-Convex and Non-Sooth Optiization Bin Gu, De Wang, Zhouyuan Huo, Heng Huang * Departent of Electrical & Coputer Engineering, University of Pittsburgh, USA Dept. of Coputer Science and Engineering, University of Texas at Arlington, USA big0@pitt.edu, wangdelp@gail.co, zhouyuan.huo@pitt.edu, heng.huang@pitt.edu Abstract In achine learning research, the proxial gradient ethods are popular for solving various optiization probles with non-sooth regularization. Inexact proxial gradient ethods are extreely iportant when exactly solving the proxial operator is tie-consuing, or the proxial operator does not have an analytic solution. However, existing inexact proxial gradient ethods only consider convex probles. The knowledge of inexact proxial gradient ethods in the non-convex setting is very liited. To address this challenge, in this paper, we first propose three inexact proxial gradient algoriths, including the basic version and Nesterov s accelerated version. After that, we provide the theoretical analysis to the basic and Nesterov s accelerated versions. The theoretical results show that our inexact proxial gradient algoriths can have the sae convergence rates as the ones of exact proxial gradient algoriths in the non-convex setting. Finally, we show the applications of our inexact proxial gradient algoriths on three representative non-convex learning probles. Epirical results confir the superiority of our new inexact proxial gradient algoriths. Introduction Many achine learning probles involve non-sooth regularization, such as the achine learning odels with a variety of sparsity-inducing penalties Bach et al. 0). Thus, efficiently solving the optiization proble with nonsooth regularization is iportant for any achine learning applications. In this paper, we focus on the optiization proble of achine learning odel with non-sooth regularization as: in fx) =gx)+hx) ) x R N where g : R N R corresponding to the epire risk is sooth and possibly non-convex, and h : R N R corresponding to the regularization ter is non-sooth and possibly non-convex. Proxial gradient ethods are popular for solving various optiization probles with non-sooth regularization. The pivotal step of the proxial gradient ethod is to solve * To who all correspondence should be addressed. Copyright 08, Association for the Advanceent of Artificial Intelligence All rights reserved. the proxial operator as following: Prox y) =argin x R N γ x y + hx) ) where γ is the stepsize. If the function hx) is siple enough, we can obtain the solution of the proxial operator analytically. For exaple, if hx) = x, the solution of the proxial operator can be obtained by a shrinkage thresholding operator Beck and Teboulle 009). If the function hx) is coplex such that the corresponding proxial operator does not have an analytic solution, a specific algorith should be designed for solving the proxial operator. For exaple, if hx) = x + c i<j ax{ x i, x j } as used in OSCAR Zhong and Kwok 0) for the sparse regression with autoatic feature grouping, Zhong and Kwok 0) proposed an iterative group erging algorith for exactly solving the proxial operator. However, it would be expensive to solve the proxial operators when the function hx) is coplex. Once again, take OSCAR as an exaple, when the data is with high diensionality epirically larger than,000), the iterative group erging algorith would becoe very inefficient. Even worse, there would be no solver for exactly solving the proxial operators when the function hx) is over coplex. For exaple, Grave, Obozinski, and Bach 0) proposed the trace Lasso nor to take into account the correlation of the design atrix to stabilize the estiation in regression. However, due to the coplexity of trace Lasso nor, there still have no solver for solving the corresponding proxial operator, to the best of our knowledge. To address the above issues, Schidt, Le Roux, and Bach 0) first proposed the inexact proxial gradient ethods including the basic version ) and Nesterov s accelerated version A)), which solves the proxial operator approxiately i.e., tolerating an error in the calculation of the proxial operator). They proved that the inexact proxial gradient ethods can have the sae convergence rates as the ones of exact proxial gradient ethods, provided that the errors in coputing the proxial operator decrease at appropriate rates. Independently, Villa et al. 03) proposed A algorith and proved the corresponding convergence rate. In the paper of Villa et al. 03), The stepsize γ is set anually or autoatically deterined by a backtracking line-search procedure Beck and Teboulle 009). 3093

2 Table : Representative exact and inexact) proxial gradient algoriths. C and NC are the abbreviations of convex and non-convex respectively.) Algorith Proxial Accelerated gx) hx) Reference +A Exact Yes C C Beck and Teboulle 009) A Exact Yes C+NC C Ghadii and Lan 06) Exact No NC NC Boţ, Csetnek, and László 06) A Exact Yes C+NC C+NC Li and Lin 05) +A Inexact Yes C C Schidt, Le Roux, and Bach 0) AIFB A) Inexact Yes C C Villa et al. 03) +A Inexact Yes C+NC C+NC Ours they called A as the inexact forward-backward splitting ethod AIFB) which is well-known in the field of signal processing. We suarize these works in Table. Fro Table, we find that the existing inexact proxial gradient ethods only consider convex probles. However, a lot of optiization probles in achine learning are nonconvex. The non-convexity originates either fro the epirical risk function gx) or the regularization function hx). First, we investigate the epirical risk function gx) i.e., loss function). The correntropy induced loss Feng et al. 05) is widely used for robust regression and classification, which is non-convex. The syetric sigoid loss on the unlabeled saples is used in sei-supervised SVM Chapelle, Chi, and Zien 006) which is non-convex. Second, we investigate the regularization function hx). Capped-l penalty Zhang 00) is used for unbiased variable selection, and the low rank constraint Jain, Meka, and Dhillon 00) is widely used for the atrix copletion. Both of these regularization functions are non-convex. However, our knowledge of inexact proxial gradient ethods is very liited in the non-convex setting. To address this challenge, in this paper, we first propose three inexact proxial gradient algoriths, including the basic and Nesterov s accelerated versions, which can handle the non-convex probles. Then we give the theoretical analysis to the basic and Nesterov s accelerated versions. The theoretical results show that our inexact proxial gradient algoriths can have the sae convergence rates as the ones of exact proxial gradient algoriths. Finally, we provide the applications of our inexact proxial gradient algoriths on three representative non-convex learning probles. The applications on robust OSCAR and link prediction show that, our inexact proxial gradient algoriths could be significantly faster than the exact proxial gradient algoriths. The application on robust Trace Lasso fills the vacancy that there is no proxial gradient algorith for Trace Lasso. Contributions. The ain contributions of this paper are suarized as follows:. We first propose the basic and accelerated inexact proxial gradient algoriths with rigorous convergence guarantees. Specifically, our inexact proxial gradient algoriths can have the sae convergence rates as the ones of exact proxial gradient algoriths in the non-convex setting.. We provide the applications of our inexact proxial gradient algoriths on three representative non-convex learning probles, i.e., robust OSCAR, link prediction and robust Trace Lasso. The results confir the superiority of our inexact proxial gradient algoriths. Related Works Proxial gradient ethods are one of the ost iportant ethods for solving various optiization probles with non-sooth regularization. There have been a variety of exact proxial gradient ethods. Specifically, for convex probles, Beck and Teboulle 009) proposed basic proxial gradient ) ethod and Nesterov s accelerated proxial gradient A) ethod. They proved that has the convergence rate O T ), and A has the convergence rate O T ), where T is the nuber of iterations. For non-convex probles, Ghadii and Lan 06) considered that only gx) could be non-convex, and proved that the convergence rate of A ethod is O T ). Boţ, Csetnek, and László 06) considered that both of gx) and hx) could be non-convex, and proved the convergence of ethod. Li and Lin 05) considered that both of gx) and hx) could be non-convex, and proved that the A algorith can converge in a finite nuber of iterations, in a linear rate or a sublinear rate i.e., O T )) at different conditions. We suarize these exact proxial gradient ethods in Table. In addition to the above batch exact proxial gradient ethods, there are the stochastic and online proxial gradient ethods Duchi and Singer 009; Xiao and Zhang 04). Because they are beyond the scope of this paper, we do not review the in this paper. Preliinaries In this section, we introduce the Lipschitz sooth, ε- approxiate subdifferential and ε-approxiate Kurdyka- Łojasiewicz KL) property, which are critical to the convergence analysis of our inexact proxial gradient ethods in the non-convex setting. Lipschitz sooth: For the sooth functions gx),wehave the Lipschitz constant L for gx) Wood and Zhang 996) as following. Assuption. L is the Lipschitz constant for gx). Thus, for all x and y, L-Lipschitz sooth can be presented as gx) gy) L x y 3) 3094

3 Equivalently, L-Lipschitz sooth can also be written as the forulation 4). gx) gy)+ gy),x y + L x y 4) ε-approxiate subdifferential: Because inexact proxial gradient is used in this paper, an ε-approxiate proxial operator ay produce an ε-approxiate subdifferential. In the following, we give the definition of ε-approxiate subdifferential Bertsekas et al. 003) which will be used in the ε-approxiate KL property i.e., Definition ). Definition ε-approxiate subdifferential). Given a convex function hx) :R N R and a positive scalar ε, the ε-approxiate subdifferential of hx) at a point x R N denoted as ε hx))is ε hx) = { d R N : hy) hx)+ d, y x ε } 5) Based on Definition, if d ε hx), we say that d is an ε-approxiate subgradient of hx) at the point x. ε-approxiate KL property: Originally, KL property is introduced to analyze the convergence rate of exact proxial gradient ethods in the non-convex setting Li and Lin 05; Boţ, Csetnek, and László 06). Because this paper focuses on the inexact proxial gradient ethods, correspondingly we propose the ε-approxiate KL property in Definition, where the function distx, S) is defined by distx, S) =in y S x y, and S is a subset of R N. Definition ε-kl property). A function fx) =gx) + hx) :R N, + ] is said to have the ε-kl property at u {u R N : gu) + ε hu)) }, if there exists η 0, + ], a neighborhood U of u and a function ϕ Φ η, such that for all u U {u R N : fu) <fu) < fu)+η}, the following inequality holds ϕ fu) fu))dist0, gu)+ ε hu))) 6) where Φ η stands for a class of functions ϕ :[0,η) R + satisfying:. ϕ is concave and continuously differentiable function on 0,η);. ϕ is continuous at 0, ϕ0) = 0; 3. and ϕ x) > 0, x 0,η). Inexact Proxial Gradient Algoriths In this section, we first propose the basic inexact proxial gradient algorith for the non-convex optiization, and then propose two accelerated inexact proxial gradient algoriths. Basic Version As shown in Table, for the convex probles i.e., both the functions gx) and hx) are convex), Schidt, Le Roux, and Bach 0) proposed a basic inexact proxial gradient ) ethod. We follow the fraework of in Schidt, Le Roux, and Bach 0), and give our algorith for the non-convex optiization probles i.e., either the function gx) or the function hx) is non-convex). Specifically, our algorith is presented in Algorith. Siilar with the exact proxial gradient algorith, the pivotal step of our i.e., Algorith ) is to copute an inexact proxial operator x Prox ε y) as following. { x Prox ε y) = z R N : γ z y + hz) } ε +in x γ x y + hx) where ε denotes an error in the calculation of the proxial operator. As discussed in Tappenden, Richtárik, and Gondzio 06), there are several ethods to copute the inexact proxial operator. The ost popular ethod is using a prial dual algorith to control the dual gap Bach et al. 0). Based on the dual gap, we can strictly control the error in the calculation of the proxial operator. Algorith Basic inexact proxial gradient ethod ) Input:, error ε k k =,,), stepsize γ< L. Output: x. : Initialize x 0 R d. : for k =,,do 3: Copute x k x k γ gx k )). 4: end for Accelerated Versions We first propose a Nesterov s accelerated inexact proxial gradient algorith for non-convex optiization, then give a nononotone accelerated inexact proxial gradient algorith. As shown in Table, for the convex optiization probles, Beck and Teboulle 009) proposed a Nesterov s accelerated inexact proxial gradient i.e., A) ethod, and Schidt, Le Roux, and Bach 0) proposed a Nesterov s accelerated inexact proxial gradient i.e., A) ethod. Both of A and A are accelerated by a oentu ter. However, as entioned in Li and Lin 05), traditional Nesterov s accelerated ethod ay encounter a bad oentu ter for the non-convex optiization. To address the bad oentu ter, Li and Lin 05) added another proxial operator as a onitor to ake the objective function sufficient descent. To ake the objective functions generated fro our A strictly descent, we follow the fraework of A in Li and Lin 05). Thus, we copute two inexact proxial operators z k+ y k γ gy k )) and v k+ x k γ gx k )), where v k+ is a onitor to ake the objective function strictly descent. Specifically, our A is presented in Algorith. To address the bad oentu ter in the non-convex setting, our A i.e., Algorith ) uses a pair of inexact proxial operators to ake the objective functions strictly descent. Thus, A is a onotone algorith. Actually, using two proxial operators is a conservative strategy. As entioned in Li and Lin 05), we can accept 3095

4 Algorith Accelerated inexact proxial gradient ethod A) Input:, error ε k k =,,), t 0 =0, t =, stepsize γ< L. Output: x +. : Initialize x 0 R d, and x = z = x 0. : for k =,,,do 3: y k = x k + t k t k z k x k )+ t k t k x k x k ). 4: Copute z k+ such that z k+ y k γ gy k )). 5: Copute v k+ such that v k+ x k γ gx k )). 4t k ++. 6: t k+ = { zk+ if fz 7: x k+ = k+ ) fv k+ ) v k+ otherwise 8: end for z k+ as x k+ directly if it satisfies the criterion fz k+ ) fx k ) δ z k+ y k which shows that y k is a good extrapolation. v k+ is coputed only when this criterion is not et. Thus, the average nuber of proxial operators can be reduced. Following this idea, we propose our nononotone accelerated inexact proxial gradient algorith ) in Algorith 3. Epirically, we find that the with the value of δ [0.5, ] works good. In our experients, we set δ =0.6. Algorith 3 Nononotone accelerated inexact proxial gradient ethod ) Input:, ε k k =,,), t 0 =0, t =, stepsize γ< L, δ>0. Output: x +. : Initialize x 0 R d, and x = z = x 0. : for k =,,,do 3: y k = x k + t k t k z k x k )+ t k t k x k x k ). 4: Copute z k+ such that z k+ y k γ gy k )). 5: 6: if fz k+ ) fx k ) δ z k+ y k then x k+ = z k+ 7: else 8: Copute v k+ such that v k+ x k γ gx k )). { zk+ if fz 9: x k+ = k+ ) fv k+ ) v k+ otherwise 0: end if 4t : t k+ = k ++. : end for Convergence Analysis As entioned before, the convergence analysis of inexact proxial gradient ethods for the non-convex probles is still an open proble. This section will address this challenge. Specifically, we first prove that and A converge to a critical point in the convex or non-convex setting Theore ) if {ε k } is a decreasing sequence and k= ε k <. Next, we prove that has the convergence rate O T ) for the non-convex probles Theore ) when the errors decrease at an appropriate rate. Then, we prove the convergence rates for A in the non-convex setting Theore 3). The detailed proofs of Theores, and 3 can be found in Appendix. Convergence of and A We first prove that A and A converge to a critical point Theore ) if {ε k } is a decreasing sequence and k= ε k <. Theore. With Assuption, if {ε k } is a decreasing sequence and k= ε k <, we have 0 li k gx k )+ εk hx k ) for and A in the convex and non-convex optiization. Reark. Theore guarantees that and A converge to a critical point or called as stationary point) after an infinite nuber of iterations in the convex or non-convex setting. Convergence Rates of Because both the functions gx) and hx) are possibly non-convex, we cannot directly use fx k ) fx ) or x k x for analyzing the convergence rate of, where x is an optial solution of ). In this paper, we use k= x k x k for analyzing the convergence rate of in the non-convex setting. The detailed reason is provided in Appendix. Theore shows that has the convergence rate O T ) for the non-convex optiization when the errors decrease at an appropriate rate, which is exactly the sae as the error-free case see discussion in Reark ). Theore. For gx) is non-convex, and hx) is convex or non-convex, we have the following results for :. If hx) is convex, we have that x k x k 7) k= A + fx 0 ) fx )) + ) B γ L where A = k= k= ε k. γ L γ L εk γ and B =. If hx) is non-convex, we have that x k x k 8) k= ) ) fx 0 ) fx )+ ε k γ L k= 3096

5 Reark. Theore iplies that has the convergence rate O T ) for the non-convex optiization without errors. If { ε k } is suable and hx) is convex, we can also have that has the convergence rate O T ) for the non-convex optiization. If {ε k } is suable and hx) is non-convex, we can also have that has the convergence rate O T ) for the non-convex optiization. Convergence Rates of A In this section, based on the ε-kl property, we prove that A converges in a finite nuber of iterations, in a linear rate or a sublinear rate at different conditions in the nonconvex setting Theore 3), which is exactly the sae as the error-free case Li and Lin 05). Theore 3. Assue that g is a non-convex function with Lipschitz continuous gradients, h is a proper and lower seicontinuous function. If the function f satisfies the ε-kl property, ε k = α v k+ x k, α 0, γ L α 0 and the desingularising function has the for ϕt) = C θ tθ for soe C>0, θ 0, ], then. If θ =, there exists k such that fx k )=f for all k>k and A terinates in a finite nuber of steps, where li k fx k )=f.. If θ [, ), there exists k such that for all k>k fx k ) li fx d C ) k k k) k +d C fv k ) f ) where d = ) γ +L+ α γ. γ L α 3. If θ 0, ), there exists k 3 such that for all k>k 3 fx k ) li fx C k) k k k 3 )d θ) where d =in ) θ { d C, C ) } θ θ fv 0 ) f ) θ θ Reark 3. Theore 3 iplies that A converges in a finite nuber of iterations when θ =, in a linear rate when θ [, ) and at least a sublinear rate when θ 0, ). Experients We first give the experiental setup, then present the ipleentations of three non-convex applications with the corresponding experiental results. Experiental Setup Design of Experients To validate the effectiveness of our inexact proxial gradients ethods i.e.,, A and ), we apply the to solve three representative nonconvex learning probles as follows.. Robust OSCAR: Robust OSCAR is a robust version of OSCAR ethod Zhong and Kwok 0), which is a feature selection odel with the capability to autoatically detect feature group structure. The function gx) is non-convex, and the function hx) is convex. Zhong and Kwok Zhong and Kwok 0) proposed a fast iterative group erging algorith for exactly solving the proxial operator.. Social Link Prediction: Given an incoplete atrix M user-by-user) with each entry M ij {+, }, social link prediction is to recover the atrix M i.e. predict the potential social links or friendships between users) with low-rank constraint. The function gx) is convex, and the function hx) is non-convex. The low-rank proxial operator can be solved exactly by the Lanczos ethod Larsen 998). 3. Robust Trace Lasso: Robust trace Lasso is a robust version of trace Lasso Grave, Obozinski, and Bach 0). The function gx) is non-convex, and the function hx) is convex. To the best of our knowledge, there is still no proxial gradient algorith for solving trace Lasso. We also suarize these three non-convex learning probles in Table. To show the advantages of our inexact proxial gradients ethods, we copare the convergence rates and the running tie for our inexact proxial gradients ethods and the exact proxial gradients ethods i.e.,, A and ). Table : Three typical learning applications. C, NC and PO are the abbreviations of convex, non-convex and proxial operator, respectively.) Application gx) hx) Exact PO Robust OSCAR NC C Yes Link Prediction C NC Yes Robust Trace Lasso NC C No Datasets Table 3 suarizes the six datasets used in our experients. Specifically, the Cardiac and Coil0 datasets are for the robust OSCAR application. The Cardiac dataset was collected fro hospital, which is to predict the area of the left ventricle, and is encouraged to find the hoogenous groups of features. The Soc-sign-epinions and Soc- Epinions datasets 3 are the social network data for the social link prediction application, where each row corresponds to a node, and each collu corresponds to an edge. The GasSensorArrayDrift and YearPredictionMSD datasets fro UCI 4 are for the robust trace Lasso application. Ipleentations and Results Robust OSCAR For the robust regression, we replace the square loss originally used in OSCAR with the correntropy induced loss He, Zheng, and Hu 0). Thus, we consider the robust OSCAR with the functions gx) and hx) as follows. l ) gx) = σ e y i X T i x) σ, 9) i= 0.php

6 A A A A A A A A a) Cardiac: obj. vs. iteration A A e) SSE: obj. vs. iteration A b) Cardiac: obj. vs. tie A A f) SSE: obj. vs. tie A c) Coil0: obj. vs. iteration A A g) SE: obj. vs. iteration d) Coil0: obj. vs tie A A h) SE: obj. vs. tie i) GS: obj. vs. iteration j) GS: obj. vs. tie k) YP: obj. vs. iteration l) YP: obj. vs. tie Figure : Coparison of convergence speed for different ethods. a-d): Robust OSCAR. e-h): Link prediction. i-l) Robust trace LASSO A A Table 3: The datasets used in the experients. Application Dataset Saple size Attributes Robust Cardiac 3, OSCAR Coil0,440,04 Social Link Soc-sign-epinionsSSE) 3,88 84,37 Prediction Soc-EpinionsSE) 75, ,837 Robust GasSensorArrayDriftGS) 3,90 8 Trace Lasso YearPredictionMSDYP) 5, hx) = λ x + λ ax{ x i, x j }, 0) i<j where λ 0 and λ 0 are two regularization paraeters. For exact proxial gradients algoriths, Zhong and Kwok Zhong and Kwok 0) proposed a fast iterative group erging algorith for exactly solving the proxial subproble. We propose a subgradient algorith to approxiately solve the proxial subproble. Let oj) {,,,N} denote the order of x j 5 aong { x, x,, x N } such that if oj ) <oj ),wehave 5 Here, x j denotes the j-th coordinate of the vector x. x j x j. Thus, the subgradient of hx) can be coputed as hx) = N j= λ + λ oj) )) x j. The subgradient algorith is oitted here due to the space liit. We ipleent our, A and ethods for robust OSCAR in Matlab. We also ipleent, A and in Matlab. Figures a and c show the convergence rates of the objective value vs. iteration for the exact and inexact proxial gradients ethods. The results confir that exact and inexact proxial gradients ethods have the sae convergence rates. The convergence rates of exact and inexact accelerated ethods are faster than the ones of the basic ethods and ). Figures b and d show the convergence rates of the vs. the running tie for the exact and inexact proxial gradients ethods. The results show that our inexact ethods are significantly faster than the exact ethods. When the diension increases, we can even achieve ore than 00 folds speedup. This is because our subgradient based algorith for approxiately solving the proxial subproble is uch efficient than the projection algorith for exactly solving the proxial subproble Zhong and Kwok 0). 3098

7 Social Link Prediction In social link prediction proble, we hope to predict the new potential social links or friendships between online users. Given an incoplete atrix M user-by-user) with each entry M ij {+, }, social link prediction is to recover the atrix M with low-rank constraint. Specifically, social link prediction considers the function fx) as following: in X log + exp X ij M ij )) i,j) Ω }{{} gx) ) s.t. rankx) r, where Ω is a set of i, j) corresponding to the entries of M which are observed, λ is a regularization paraeter. The proxial operator in rankx) r X X t γ gx t )) can be solved by the rank-r singular value decoposition SVD) Jain, Meka, and Dhillon 00). The rank-r SVD can be solved exactly by the Lanczos ethod Larsen 998), and also can be solved approxiately by the power ethod Halko, Martinsson, and Tropp 0; Journée et al. 00). We ipleent our, A and ethods for social link prediction in Matlab. We also ipleent, A and in Matlab. Figures e and g show the convergence rates of the objective value vs. iteration for the exact and inexact proxial gradients ethods. The results confir that exact and inexact proxial gradients ethods have the sae convergence rates. Figures f and h illustrate the convergence rates of the vs. running tie for the exact and inexact proxial gradients ethods. The results verify that our inexact ethods are faster than the exact ethods. Robust Trace Lasso Robust trace Lasso is a robust version of trace Lasso Grave, Obozinski, and Bach 0; Bach 008). Sae with robust OSCAR, we replace the square loss originally used in trace Lasso with the correntropy induced loss He, Zheng, and Hu 0) for robust regression. Thus, we consider the robust trace Lasso with the functions gx) and hx) as following: gx) = σ l e y i X i= ) T i x) σ ) hx) = λ XDiagx), 3) where is the trace nor, λ is a regularization paraeter. To the best of our knowledge, there is still no algorith to exactly solve the proxial subproble. To ipleent our and A, we propose a subgradient algorith to approxiately solve the proxial subproble. Specifically, the subgradient of the trace Lasso regularization XDiagx) can be coputed by Theore 4 which is originally provided in Bach 008). We ipleent our, A and ethods for robust trace Lasso in Matlab. Theore 4. Let UDiags)V T be the singular value decoposition of XDiagx). Then, the subgradient of the trace Lasso regularization XDiagx) is given by XDiagx) = {Diag X T UV T + M) ) : 4) M,U T M =0 and MV =0} Figures i-l show the convergence rates of the objective value vs. iteration and running tie respectively, for our inexact proxial gradient ethods. The results deonstrate that we provide efficient proxial gradient algoriths for the robust trace Lasso. More iportantly, our, A and algoriths fill the vacancy that there is no proxial gradient algorith for trace Lasso. This is because that, directly solving the proxial subproble for robust trace Lasso is quite difficult Grave, Obozinski, and Bach 0; Bach 008). Our subgradient based algorith provides an alternative approach for solving the proxial subproble. Suary of the Experiental results Based on the results of three non-convex achine learning applications, our conclusion is that our inexact proxial gradient algoriths can provide flexible algoriths to the optiization probles with coplex non-sooth regularization. More iportantly, our inexact algoriths could be significantly faster than the exact proxial gradient algoriths. Conclusion Existing inexact proxial gradient ethods only consider convex probles. The knowledge of inexact proxial gradient ethods in the non-convex setting is very liited. To address this proble, we proposed three inexact proxial gradient algoriths with the theoretical analysis. The theoretical results show that our inexact proxial gradient algoriths can have the sae convergence rates as the ones of exact proxial gradient algoriths in the non-convex setting. We also provided the applications of our inexact proxial gradient algoriths on three representative non-convex learning probles. The results confir the superiority of our inexact proxial gradient algoriths. Acknowledgeent This work was partially supported by the following grants: NSF-IIS 30675, NSF-IIS 3445, NSF-DBI 35668, NSF-IIS 69308, NSF-IIS , NIH R0 AG References Bach, F.; Jenatton, R.; Mairal, J.; Obozinski, G.; et al. 0. Optiization with sparsity-inducing penalties. Foundations and Trends in Machine Learning 4): 06. Bach, F. R Consistency of trace nor iniization. Journal of Machine Learning Research 9Jun): Beck, A., and Teboulle, M A fast iterative shrinkagethresholding algorith for linear inverse probles. SIAM journal on iaging sciences ):83 0. Bertsekas, D. P.; Nedi, A.; Ozdaglar, A. E.; et al Convex analysis and optiization. Boţ, R. I.; Csetnek, E. R.; and László, S. C. 06. An inertial forward backward algorith for the iniization of the su of two nonconvex functions. EURO Journal on Coputational Optiization 4):

8 Chapelle, O.; Chi, M.; and Zien, A A continuation ethod for sei-supervised svs. In Proceedings of the 3rd international conference on Machine learning, ACM. Duchi, J., and Singer, Y Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research 0Dec): Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; and Suykens, J. A. 05. Learning with the axiu correntropy criterion induced losses for regression. Journal of Machine Learning Research 6: Ghadii, S., and Lan, G. 06. Accelerated gradient ethods for nonconvex nonlinear and stochastic prograing. Matheatical Prograing 56-): Grave, E.; Obozinski, G. R.; and Bach, F. R. 0. Trace lasso: a trace nor regularization for correlated designs. In Advances in Neural Inforation Processing Systes, Halko, N.; Martinsson, P.-G.; and Tropp, J. A. 0. Finding structure with randoness: Probabilistic algoriths for constructing approxiate atrix decopositions. SIAM review 53):7 88. He, R.; Zheng, W.-S.; and Hu, B.-G. 0. Maxiu correntropy criterion for robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 338): Hsieh, C.-J., and Olsen, P. A. 04. Nuclear nor iniization via active subspace selection. In ICML, Jain, P.; Meka, R.; and Dhillon, I. S. 00. Guaranteed rank iniization via singular value projection. In Advances in Neural Inforation Processing Systes, Journée, M.; Nesterov, Y.; Richtárik, P.; and Sepulchre, R. 00. Generalized power ethod for sparse principal coponent analysis. Journal of Machine Learning Research Feb): Larsen, R. M Lanczos bidiagonalization with partial reorthogonalization. DAIMI Report Series 7537). Li, H., and Lin, Z. 05. Accelerated proxial gradient ethods for nonconvex prograing. In Advances in Neural Inforation Processing Systes, Lin, Q.; Lu, Z.; and Xiao, L. 04. An accelerated proxial coordinate gradient ethod. In Advances in Neural Inforation Processing Systes, Schidt, M.; Le Roux, N.; and Bach, F. 0. Convergence rates of inexact proxial-gradient ethods for convex optiization. arxiv preprint arxiv: Tappenden, R.; Richtárik, P.; and Gondzio, J. 06. Inexact coordinate descent: coplexity and preconditioning. Journal of Optiization Theory and Applications Villa, S.; Salzo, S.; Baldassarre, L.; and Verri, A. 03. Accelerated and inexact forward-backward algoriths. SIAM Journal on Optiization 33): Wood, G., and Zhang, B Estiation of the lipschitz constant of a function. Journal of Global Optiization 8):9 03. Xiao, L., and Zhang, T. 04. A proxial stochastic gradient ethod with progressive variance reduction. SIAM Journal on Optiization 44): Zhang, T. 00. Analysis of ulti-stage convex relaxation for sparse regularization. Journal of Machine Learning Research Mar): Zhong, L. W., and Kwok, J. T. 0. Efficient sparse odeling with autoatic feature grouping. IEEE transactions on neural networks and learning systes 39):

Stochastic Subgradient Methods

Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods