DIFFERENCE-OF-CONVEX LEARNING: DIRECTIONAL STATIONARITY, OPTIMALITY, AND SPARSITY

Size: px

Start display at page:

Download "DIFFERENCE-OF-CONVEX LEARNING: DIRECTIONAL STATIONARITY, OPTIMALITY, AND SPARSITY"

Antony Campbell
5 years ago
Views:

1 SIAM J. OPTIM. Vo. 7, No. 3, pp c 017 Society for Industria and Appied Mathematics DIFFERENCE-OF-CONVEX LEARNING: DIRECTIONAL STATIONARITY, OPTIMALITY, AND SPARSITY MIJU AHN, JONG-SHI PANG, AND JACK XIN Abstract. This paper studies a fundamenta bicriteria optimization probem for variabe seection in statistica earning; the two criteria are a oss/residua function and a mode contro (aso caed reguarization, penaty). The former function measures the fitness of the earning mode to data and the atter function is empoyed as a contro of the compexity of the mode. We focus on the case where the oss function is (strongy) convex and the mode contro function is a differenceof-convex (dc) sparsity measure. Our paper estabishes some fundamenta optimaity and sparsity properties of directiona stationary soutions to a nonconvex Lagrangian formuation of the bicriteria optimization probem, based on a speciay structured dc representation of many we-known sparsity functions that can be profitaby expoited in the anaysis. We reate the Lagrangian optimization probem with the penaty constrained probem in terms of their respective d(irectiona)-stationary soutions; this is in contrast to common anaysis that pertains to the (goba) imizers of the probem which are not computabe due to nonconvexity. Most importanty, we provide sufficient conditions under which the d(irectiona)-stationary soutions of the nonconvex Lagrangian formuation are goba imizers (possiby restricted due to nondifferentiabiity), thereby fiing the gap between previous imizer-based anaysis and practica computationa considerations. The estabished reation aows us to readiy appy the derived resuts for the Lagrangian formuation to the penaty constrained formuation. Speciaizations of the conditions to exact and surrogate sparsity functions are discussed, yieding optimaity and sparsity resuts for existing nonconvex formuations of the statistica earning probem. Key words. statistica earning, difference-of-convex programs, directiona stationary points, optimaity, sparsity AMS subject cassifications. 90C6, 65K10, 6B10 DOI /16M Introduction. Sparse representation 7] is a fundamenta methodoogy of data science in soving a broad range of probems from statistica and machine earning in artificia inteigence to physica sciences and engineering (e.g., imaging and sensing technoogies), and to medica decision making (e.g., cassification of heathy versus unheathy patients, benign and cancerous tumors). Significant advances have been made in the ast decade on constructing intrinsicay ow-dimensiona soutions in high-dimensiona probems via convex programg. In statistica earning, the Least Absoute Shrinkage and Seection Operator (LASSO) 49] is an efficient inear optimization method in regression and variabe seection probems based on the imization of the 1 -norm x 1 n x i of the n-dimensiona mode variabe x, either subject to a certain prescribed residua constraint, eading to a constrained optimization probem, or empoying such a residua as an additiona criterion to be imized, resuting in an unconstrained imization probem. (Throughout this Received by the editors Juy 14, 016; accepted for pubication (in revised form) May 1, 017; pubished eectronicay August 8, Funding: The first and second author s work was supported by the Nationa Science Foundation under grants CMMI and IIS and by the Air Force Office of Scientific Research under grant FA The third author s work was supported by the Nationa Science Foundation grants DMS-1507, DMS-15383, and IIS Department of Industria and Systems Engineering, University of Southern Caifornia, Los Angees, CA (mijuahn@usc.edu, jongship@usc.edu). Department of Mathematics, University of Caifornia, Irvine, CA 9697 (jxin@math.uci.edu). 1637

2 1638 MIJU AHN, JONG-SHI PANG, AND JACK XIN paper, we address the case of vector optimization and eave the anaysis of matrix optimization probems to future work.) Such a convex norm is empoyed as a surrogate of the 0 -function of x; i.e., x 0 n x i 0, where for a scaar t, t 0 { 1 if t 0 0 if t = 0 counts the nonzero entries in the mode variabe x. Theoretica anaysis of LASSO and its abiity for recovery of the true support of the ground truth vector was pioneered by Candès and Tao 6, 7]. This operator has many good statistica properties such as mode seection consistency 56], estimation consistency 3, 9], and persistence property for prediction 6]. There exist many modified versions and agorithms proposed to improve the computationa efficiency of LASSO; incuding adaptive LASSO 58], Bregman Iterative Agorithms 5], the Dantzig Seector 10, 8], iterativey reweighted LASSO 9], eastic net 59], and LARS (Least Ange Regression) 15]. Unti now, due to its favorabe theoretica underpinnings and many efficient soution methods, convex optimization has been a principa venue for soving many statistica earning probems. Yet there is increasing evidence supporting the use of nonconvex formuations to enhance the reaism of the modes and improve their generaizations. For instance, in compressed sensing and image science 13], recent findings reported in 34, 51, 54] show that the difference of 1 and norms ( 1 for short) and a (nonconvex) transformed 1 surrogate outperform 1 and other known convex penaties in sparse signa recovery when the sensing matrix is highy coherent; such a regime occurs in superresoution imaging where one attempts to recover fine scae structure from coarse scae data information; see 1, 5] and references therein. More broady, important appications in other areas such as computationa statistics, machine earning, bio-informatics, and portfoio seection 8] offer promising sources of probems for the empoyment of nonconvex functionas to express mode oss, promote sparsity, and enhance robustness. The present paper is motivated by the recent furry of activities pertaining to the use of nonconvex functionas for statistica earning probems. This idea dates back more than fifteen years ago in statistics when Fan and Li 17] pointed out that LASSO is a biased estimator and postuated severa desirabe statistics based properties for a good univariate surrogate function for 0 to have; the authors then introduced a univariate, butterfy shaped, foded concave 19], piecewise quadratic function caed SCAD for smoothy cipped absoute deviation that is symmetric with respect to the origin and concave on R + ; see aso 18]. Besides SCAD and 1, there exist today many penaized regression methods using nonconvex penaties, incuding the q (for q (0, 1)) penaty 0, 3], the reated p q penaty for p 1 and q 0, 1) 11], the combination of 0 and 1 penaties 3], the smooth integration of counting and absoute deviation (SICA) 35], the imax concave penaty (MCP) 53], the capped- 1 penaty 55], the ogarithmic penaty 37], the truncated 1 penaty 48], the hard threshoding penaty 57], the 1 penaty 51], and the transformed 1 40, 35, 54] mentioned above. Whie there are agorithms for soving severa of these nonconvex optimization probems, such as 17, 19, 53, 37, 4], due to the nondifferentiabiity of the imands in these nonconvex optimization probems, it is not cear what kind of stationary points these agorithms actuay compute. Indeed, without a cear understanding of such points, it is not possibe to rigorousy ascertain the stationarity, et aone imizing, properties of the imit points of the iterates produced by the agorithms. This wi aso hep to cose the gap between a imizer-based statistica anaysis and a practica computationa procedure, reaxing the requirement of optimaity to a more reaistic stationarity property of the computationa mode in such an anaysis.

3 DC LEARNING 1639 Our work focuses on two types of sparsity measures: one type consists of those mentioned above that are surrogates of the 0 -function; the other cass of functions is motivated by the 1 -function; this function has the interesting property that its zeros coincide with the 1-sparse vectors, i.e., x 1 x = 0 if and ony if x has at most one nonzero component. Generaizing this function, we exae a function whose zeros are the K-sparse vectors for a given positive integer K; these are vectors with at most K nonzero components. Whie this function has been used to describe the rank of matrices 1, 5, 39], its use to express the sparsity of vectors has not received much attention in the iterature. One exception is the reference 5] that has recognized the dc property 44, 45] of the K-sparsity function and empoyed it to describe cardinaity constraints. A these functions are nonconvex and nondifferentiabe, making the resuting optimization probems chaenging to anayze and sove. A major goa of our study is to address these chaenges, providing anaysis, agorithms, and numerics to demonstrate the properties and benefits of such nonconvex sparsity functions. The overarching contention is that for the anaysis of the resuting optimization probems, it is essentia that we focus on soutions that are computabe in practice, rather than on soutions that are not provaby computabe (such as goba imizers that cannot be guaranteed to be the outcome of an agorithm). Thus, our contribution is to provide the theoretica underpinning of what may be termed computabe statistica earning, reying on the recent work 4] where numerica agorithms supported by convergence theories have been introduced for computing d-stationary soutions to genera nonsmooth dc programs; see aso 43]. An accompanying study 1] wi report the detais of computationa experience of such agorithms appied to the probems described in this paper. With the above background and review of the iterature, we are ready to summarize the mutifod contributions of our work whose setting is on vector probems. The starting point of these contributions is the introduction of a unified formuation with a piecewise-smooth dc objective function for a the sparsity measures mentioned above. Since the concave summand of severa objective functions are of the piecewise smooth, thus nondifferentiabe kind, care must be exerted in understanding the stationary soutions of the optimization probems. For this purpose, we wi buid on the theory in the recent paper 4] that has identified directiona stationarity as the east restrictive concept for nonsmooth dc programs with the resuting d-stationary points being computabe by practica agorithms. Most importanty, a major contribution of our work is the derivation of severa theoretica resuts on the nonconvex Lagrangian formuation of the bicriteria statistica earning probem, giving conditions under which d-stationary soutions of the Lagrangian optimization probem are its goba ima (perhaps restricted due to nondifferentiabiity), and in some cases, offering error bounds on the deviation for these soutions to an arbitrary vector, incuding the ground truth to be discovered. Unike much anaysis in the statistics iterature, our approach is more in ine with computationa optimization by focusing on the optimization mode that is empoyed as the principa computationa vehice for discovery. In this vein, our anaysis is reated to that in 11] which has studied the choice of the weighing factor and derived properties of a goba imizer of the particuar p probem ; in contrast, our genera resuts in section 3 are appicabe to an arbitrary (strongy) convex oss function and a dc penaty function that covers many known sparsity functions in the iterature. The anaysis in the present paper is reated to that in two recent artices 31, 33]. There are severa differences, however. Perhaps the main one is the respective foci of the cited papers and ours. In 31], where there are resuts connecting the sparsity

4 1640 MIJU AHN, JONG-SHI PANG, AND JACK XIN imization probems formuated using the origina 0 objective and those using its surrogates, the emphasis therein is on the appication of the dc agorithm to the atter surrogate formuations, inking them to existing machine earning schemes. In 33], the anaysis is more from a statistica perspective whie ours is more from a practica optimization perspective. Specificay, the goa in 33] is to gain a deep understanding of how the soution to a reguarized optimization probem can recover a true imizer of the expected oss function (the so-caed ground truth), where the former probem is constructed by imizing the sum of the oss function pus a reguarization term and with the constraint that the support of the variabes is contained in that of the ground truth. Under extensive assumptions on the parameter choices, the anaysis therein is quite detaied and covers many oss and reguarization functions known in the iterature. Nevertheess, besides the support constraint imposed in the optimization probem, two restrictions were required of the reguarizer, i.e., the penaty function: separabiity and differentiabiity (except at the origin); these restrictions rue out severa important casses of sparsity functions, such as the exact K-sparse functions (see section 5) and the famiy of piecewise smooth functions such as the capped 1 -function and the piecewise inear approximation of the -function in the 1 -function. The exact K-sparsity function is nonseparabe; both casses of functions have nontrivia manifods of nondifferentiabe points that are potentiay the imit points of iterates computed by an optimization procedure. In contrast, our theory is appicabe to a these functions. More importanty, we aim to address the foowing issue, which from a computationa perspective is of more practica significance. Namey, given a oss function and a reguarizing/penaty function, can one identify suitabe weights combining these two criteria so that a d-stationary soution of the resuting nonconvex, nonsmooth optimization probem has some imizing properties? Once this is done, the atter property then aows further anaysis of the soution that incudes error bounds from the ground truth in terms of some known statistica estimates. Our anaysis is supported by numerica agorithms that can compute such a stationary soution, based on the recent work of 4]. Throughout, the dc property of the functions invoved is the backbone of our resuts. Athough this property is not expicity stated in 33], one of the assumptions imposed therein on the univariate reguarizer impies that it is a dc function. In summary, whie there exists previous anaysis of the bicriteria optimization probem with sparsity functions, our work contributes to a deeper understanding of the probem, particuary for nondifferentiabe sparsity functions. The principa goa of our anaysis is to address the optimaity and sparsity properties of a directiona stationary soution of the probem that can be computed in practice using a dcbased optimization agorithm 4]. Another noteworthy point is that we attempt to dea with an anaysis sufficienty broad that is appicabe to a host of oss and penaty functions, rather than focusing on speciaized probems whose generaizations to other formuations may not be immediate.. The unified DC program of statistica earning. We consider the foowing Lagrangian form of the bicriteria statistica earning optimization probem for seecting the mode variabe w R m : (1) imize w W Z γ(w) (w) + γ P (w), where γ is a positive scaar parameter baancing the oss (or residua) function (w) and the mode contro (aso caed penaty) function P (w) and W is a cosed convex (typicay poyhedra) set in R m constraining the mode variabe w. The unconstrained

5 DC LEARNING 1641 case where W = R m is of significant interest in many appications. Throughout the paper, we assume that (A 0 ) (w) is a convex function on the cosed convex set W and P (w) is a differenceof-convex (dc) function given by () P (w) g(w) h(w), with h(w) max 1 i I h i(w), for some integer I > 0, where g is convex but not necessariy differentiabe and each h i is convex and differentiabe. Thus the concave summand of P (w), i.e., h(w), is the negative of a pointwise maximum of finitey many convex differentiabe functions; as such h(w) is piecewise differentiabe (PC 1 ) according to the definition of such a piecewise smooth function 16, Definition 4.5.1]. (Specificay, a continuous function f(w) is PC 1 on an open domain Ω if there exist finite many C 1 functions {f i } M for some positive integer M such that f(w) {f i (w)} M for a w Ω. If each f i is an affine function, we say that f is piecewise affine on Ω.) In many (inexact) surrogate sparsity functions, the function P (w) is separabe in its arguments; i.e., P (w) = m p i(w i ), where each p i (w i ) is a univariate dc function whose concave summand is the negative of a pointwise maximum of finitey many convex differentiabe (univariate) functions; i.e., the univariate anaog of (). It is important to note that whie it has been recognized in the iterature (e.g.,, 4, 31, 50]) that severa casses of sparsity functions can be formuated as dc functions, the particuar form () of the function h(w) has not been identified in the genera dc approach to the sparsity optimization. Our work herein expoits this piecewise structure profitaby. Since every convex function can be represented as the pointwise maximum of a famiy of affine functions, possiby a continuum of them, it is natura to ask if the resuts in our work can be extended to a genera convex function h. A fu treatment of this question is regrettaby beyond the scope of this paper which is aimed at addressing probems in sparsity optimization and not for genera dc programs..1. A mode-contro constrained formuation. With two criteria, the oss (w) and mode contro P (w), there are two aternative optimization formuations for the choice of the mode variabe w; one of these two aternative formuations constrains the atter function whie imizing the former; this is in contrast to baancing the two criteria using a weighing parameter γ into a singe combined objective function to be imized. Specificay, given a prescribed scaar β > 0, consider (3) imize w W (w) subject to P (w) β, which we ca the mode contro-constrained version of (1). Since both (1) and (3) are nonconvex probems and invove nondifferentiabe functions, the connections between them are not immediatey cear. For one thing, from a computationa point of view, it is not practica to reate the two probems via their goba ima; the reason is simpe: such ima are computationay intractabe. Instead, one needs to reate these probems via their respective computabe soutions. Toward this end, it seems reasonabe to reate the d(irectiona)-stationary soutions of (1) to the B(ouigand)- stationary soutions of (3) as both kinds of soutions can be computed by the majorization/inearization agorithms described in 4] and they are the sharpest kind of stationary soutions for directionay differentiabe optimization probems. Before presenting the detais of this theory reating the two probems (1) and (3), we note

6 164 MIJU AHN, JONG-SHI PANG, AND JACK XIN that a third formuation can be introduced wherein one imizes the penaty function P (w) whie constraining the oss function (w) not to exceed a prescribed toerance. As the oss function (w) is assumed to be convex in our treatment, the atter formuation is a convex constrained dc program; this is unike (3) that is a dc constrained dc program, a probem that is consideraby more chaenging than a convex constrained program. Thus we wi omit this oss function constrained formuation and focus ony on reating the Lagrangian formuation (1) and the penaty constrained formuation (3)... Stationary soutions. With W being a cosed convex set, a vector w W is formay a d-stationary soution of (1) if the directiona derivative Z γ(w ; w w ) im τ 0 Z γ (w + τ(w w )) Z γ (w ) τ 0 w W. Letting Ŵβ {w W P (w) β} be the (nonconvex) feasibe set of (3), we say that a vector w Ŵβ is a B-stationary soution of this probem if ( w; dw) 0 for a dw T (Ŵβ; w), where T (Ŵβ; w) is the tangent cone of Ŵβ at w; the atter cone consists of a vector dw for which there exist a sequence of vectors {w k } Ŵ β converging to w and a sequence of positive scaars {τ k } converging to zero such w that dw = im k w k τ k. This paper focuses on the derivation of optimaity and sparsity properties of directiona stationary soutions to the probem (1). Through the connections with the constrained formuation (3) estabished beow, the obtained resuts for (1) can be adopted to the former probem. A natura question arises why we choose to focus on directiona stationary soutions rather than stationary soutions of other kinds, such as that of a critica point for dc programs 4, 31, 44, 45]. For reasons given in 4, section 3.3] (see, in particuar, the summary of impications therein), directiona stationary soutions are the sharpest kind among stationary soutions of other kinds in the sense a directiona stationary soution must be stationary according to other definitions of stationarity. Moreover, as shown in Proposition.1 beow and Proposition 3.1 ater, d-stationary soutions possess imizing properties that are not in genera satisfied by stationary soutions of other kinds. In essence, the reason for the sharpness of d-stationarity is because it captures a the active pieces as described by the index set A(w ) at the point under consideration w ; see (4). In contrast, a critica point of a dc program fais to capture a the active pieces. Thus any property that a critica point might have may not be appicabe to a the active pieces. Our first resut is a characterization of a d-stationary soution of (1). Proposition.1. Under assumption (A 0 ), a vector w W is a d-stationary soution of (1) if and ony if (4) w arg w W (w) + γ { g(w) h i (w ) T ( w w ) } }{{} i A(w ), convex function in w where A(w ) { i h i (w ) = h(w ) }. Moreover, if the function h is piecewise affine, then any such d-stationary soution must be a oca imizer of Z γ on W. Proof. It suffices to note that h (w ; dw) = max i A(w ) h i (w ) T dw for a vectors dw R m. The ast assertion about the oca imizing property foows readiy

7 DC LEARNING 1643 from the fact 16, Exercise ] that we have h(w) = h(w ) + h (w ; w w ) for any piecewise affine function h, provided that w is sufficienty near w. Remark. The ast assertion in the above proposition generaizes a resut in 30] where an additiona differentiabiity assumption was assumed. The characterization of a B-stationary point of (3) is ess straightforward, the reason being the nonconvexity of the feasibe set Ŵβ that makes it difficut to unvei the structure of the tangent cone T (Ŵβ; w). If P ( w) < β, then T (Ŵβ; w) = T (W; w). The foowing resut shows that this case is not particuary interesting when it pertains to the stationarity anaysis of w. Proposition.. Under assumption (A 0 ), if w W β is a B-stationary point of (3) such that P ( w) < β, then w is a goba imizer of (3). Proof. As noted above, the fact that P ( w) < β impies T (Ŵβ; w) = T (W; w); hence w satisfies ( w; dw) 0 for a dw T (W; w). Since is a convex function and W is a convex set, it foows that w is a goba imizer of on W. In turn, this impies that w is a goba imizer of on Ŵβ because Ŵβ is a subset of W. To understand the cone T (Ŵβ; w) when P ( w) = β, we need certain constraint quaifications (CQs), under which it becomes possibe to obtain a necessary and sufficient condition for a B-stationary point of (3) simiar to that in Proposition.1 for a d-stationary point of (1). We ist two such CQs as foows, one of which is a pointwise condition whie the other one pertains to the entire set Ŵβ: The pointwise Sater CQ is said to hod at w Ŵβ satisfying P ( w) = β if d T (W β ; w) exists such that g ( w; d) < h j ( w) T d for a j A( w). The piecewise affine CQ (PACQ) is said to hod for Ŵβ if W is a poyhedron and the function g is piecewise affine. (This CQ impies neither the convexity nor the piecewise poyhedraity of Ŵβ because no inearity is assumed for the functions h i.) Under these CQs, we have the foowing characterization of a B-stationary point of the penaty constrained probem (3). See 4] for a proof. Proposition.3. Under assumption (A 0 ), if either the pointwise Sater CQ hods at w or the PACQ hods for the set Ŵβ, then w is a B-stationary point of (3) if and ony if ( w; dw) 0 for a dw Ĉ j ( w) { d T (W; w) g ( w; d) h j ( w) T d } and every j A(w ). With the above two propositions, we can formay reate the two probems (1) and (3) as foows. Proposition.4. Under assumption (A 0 ), the foowing two statements hod. (a) If w is a d-stationary soution of (1) for some γ 0, and if the pointwise Sater CQ hods at w for the set Ŵβ where β P (w ), then w is a B-stationary soution of (3). (b) If w is a B-stationary soution of (3) for some β > 0, and if the pointwise Sater CQ hods at w for the set Ŵβ, then the foowing two statements hod: for each j A( w), a scaar γ j 0 exists such that w is a d-stationary soution of (5) imize w W (w) + γ j ( g(w) h j (w) ) ] ; nonnegative scaars γ and { γ j } I j=1 exist with I j=1 γ j = 1 and γ j (h(w) h j (w)) =

8 1644 MIJU AHN, JONG-SHI PANG, AND JACK XIN 0 for a j such that w is a d-stationary soution of I imize (w) + γ g(w) γ j h j (w). w W Finay, statements (a) and (b) remain vaid if the PACQ hods instead of the pointwise Sater CQ. j=1 Proof. If w is a d-stationary soution of (1), then (w ; w w ) + γ g (w ; w w ) h i (w ) T (w w ) ] 0 for a w W and a i A(w ). Let dw Ĉ j ( w) for some j A(w ). We have w dw = im k w k τ k for some sequence {w k } W converging to w and sequence {τ k } R ++ converging to zero. For each k, we have (w ; w k w ) + γ g (w ; w k w ) h i (w ) T (w k w ) ] 0. Dividing by τ k and etting k, we deduce (w ; dw) + γ g (w ; dw) h i (w ) T dw ] 0, which impies (w ; dw) 0 because the term in the square bracket is nonpositive. This proves parts (a) and (c). To prove (b), et w be as given. By Proposition.3, it foows that for every j A( w), ( w; dw) 0 for a dw Ĉ j ( w) { d T (W; w) g ( w; d) h j ( w) T d }. Thus dw = 0 is a imizer of ( w; dw) for dw Ĉ j ( w). Since the atter set is convex and satisfies the Sater CQ, a standard resut in noninear programg duaity theory (see, e.g., ]) yieds the existence of a scaar γ j such that dw = 0 is a imizer of ( w; dw) + γ j g ( w; dw) h j ( w) T dw ] on T (W; w). This proves statement (i) in (b). Adding up the inequaities ( w; dw) + γ j g ( w; dw) h j ( w) T dw ] 0, dw T (W; w), for j A( w), we deduce ( w; dw) + γ g ( w; dw) where γ 1 A( w) j A( w) γ j h j ( w) T dw 0, dw T (W; w), j A( w) γ γ j and γ j j j A( w) γ. This estabishes (b). j Part (b) of Proposition.4 fas short in estabishing the converse of part (a), i.e., in estabishing the fu equivaence of the two famiies of probems {(1)} γ 0 and {(3)} β 0 in terms of their d-stationary soutions and B-stationary soutions, respectivey. This adds to the known chaenges of the atter dc-constrained famiy of dc probems over the former convex constrained dc programs. In the next section, we focus on the Lagrangian formuation (1) ony and rey on the connections estabished in this section to obtain parae resuts for the constrained formuation (3). Proposition.4 compements the penaty resuts in dc programg initiay studied by 46] and recenty expanded in 5, 4]. These previous penaty resuts address the question of when a fixed constrained probem (3) has an exact penaty formuation

9 DC LEARNING 1645 as the probem (1) for sufficient arge γ in terms of their goba ima. Furthermore, in the case of a quadratic oss function, the reference 5] derives a ower bound for the scaar γ for the penaty formuation to be exact. In contrast, aowing for a genera convex oss function, Proposition.4 deas with stationary soutions that from a computationa perspective are more reasonabe as the resuts pertaining to goba ima ack pragmatism and shoud at best be regarded as providing ony theoretica insights and conceptua evidence about the connection between the two formuations. To cose this subsection, we give a resut pertaining to the case where the function h is piecewise affine; in this case, by Proposition.1, every d-stationary soution of (1) must be a oca imizer. Combining this fact with Proposition.4(b), we can estabish a simiar resut for the probem (3). Proposition.5. Let assumption (A 0 ) hod with each h i being affine additionay. If w is a B-stationary soution of (3) for some β > 0, such that P ( w) = β, and if either the pointwise Sater CQ hods at w for the set Ŵβ or the PACQ hods for the same set, then w is a oca imizer of (3). Proof. By Proposition.4(b) and (c), for each j A( w), a scaar γ j 0 exists such that w is a d-stationary soution, thus imizer, of (5), which is a convex program because h j is affine. To compete the proof, et w Ŵβ be sufficienty cose to w such that A(w) A( w). Let j A(w). We then have ( w) = ( w) + γ j P ( w) γ j β = ( w) + γ j ( g( w) h j ( w) ) γ j β because j A( w) (w) + γ j ( g(w) h j (w) ) γ j β by the imizing property of w = (w) + γ j ( P (w) β ) (w) because j A(w) and P (w) β. This estabishes that w is a oca imizer of (3). 3. Minimizing and sparsity properties. In genera, a stationary soution of (1) is not guaranteed to possess any imizing property. For smooth probems invoving twice continuousy differentiabe functions, it foows from cassica noninear programg theory that with an appropriate second-order sufficiency condition, a stationary soution can be shown to be stricty ocay imizing. Such a we-known resut becomes highy compicated when the functions are nondifferentiabe. Athough one coud, in principe, appy some generaized second derivatives, e.g., those based on Mordukhovich s no-smooth cacuus 38], it woud be preferabe in our context to derive a simpified theory that is more readiy appicabe to particuar sparsity functions, such as those to be discussed in sections 5 and Preiary discussion of the main resut. In a nutshe, our goa is to extend the characterization of a d-stationary soution of (1) in Proposition.1 by showing that under a set of assumptions, incuding a specific choice of the convex function g, for a range of vaues of γ to be identified in the anaysis, any such nonzero d-stationary soution w either has w 0 bounded above by a scaar computabe from certain mode constants, or (6) w arg w W (w) + γ ( g(w) h i (w) ) i A(w ); }{{} remains dc

10 1646 MIJU AHN, JONG-SHI PANG, AND JACK XIN see Theorem 3. in subsection 3.4 for detais. In terms of the origina function Z γ, the above condition impies a restricted goba optimaity property of w ; namey, (7) w arg w W Z γ (w), where W { w W A(w) A(w ) }. To see this impication, it suffices to note that if w W, then etting i be a common index in A(w) A(w ), we have Z γ (w) Z γ (w ) = (w) + γ ( g(w) h i (w) ) ] (w ) + γ ( g(w ) h i (w ) ) ] 0. Borrowing a teroogy from piecewise programg, we say that the vectors w and w share a common piece if A(w) A(w ). Thus the restricted goba optimaity property (7) of w says that w is a true goba imizer of Z γ (w) among those vectors w W that share a common piece with w. The subset W incudes a neighborhood of w in W; i.e., for a suitabe scaar δ > 0, B(w ; δ) W W, where B(w ; δ) is an Eucidean ba centered at w and with radius δ > 0. Thus, the optimaity property (6) impies in particuar that w is a oca imizer of Z γ on W. Another consequence of (6) is that if I = 1 (so that h is convex and differentiabe), then w is a goba imizer of Z γ on W. Besides the above specia cases, our main resut appies to many we-known sparsity functions in the statistica earning iterature, pus the reativey new and argey unexpored cass of exact K-sparsity functions to be discussed ater. 3.. The assumptions. The anaysis requires two main sets of assumptions: one on the mode functions (w) and {h i (w)} I and the other one on the constraint set W. We first state the former set of assumptions, which introduce the key mode constants Lip, λ, Lip h, and ζ. In principe, we coud ocaize these assumptions around a given d-stationary soution; instead, our intention of stating these assumptions gobay on the set W is to use the mode constants to identify the parameter γ that defines the optimization probem (1). (A) In addition to (A 0 ), (A L ) the oss function (w) is of cass LC1 (for continuousy differentiabe with a Lipschitz gradient) on W; i.e., a positive scaar Lip exists such that for a w and w in W, (8) (w) (w ) Lip w w w, w W; (A cvx ) a nonnegative constant λ exists such that (9) (w) (w ) (w ) T ( w w ) λ w w w, w W; (A L h ) nonnegative constants Lip h and β exist such that for each i {1,..., I} and a w and w in W, with 0/0 defined to be zero, ( (10) 0 h i (w) h i (w ) h i (w ) T ( w w Lip h ) + β ) w w w. (A B h ) max 1 i I sup w W h i (w) ] ζ <. By the mean-vaue theorem for mutivariate functions 41, 3..1], we derive from the inequaity (8), (w) (w ) (w ) T ( w w ) Lip w w w, w W.

11 DC LEARNING 1647 Thus Lip λ. When (w) = 1 wt Qw + q T w is a strongy convex quadratic function with a symmetric positive definite matrix Q, the ratio λ Lip = λ(q) λ, max(q) where the numerator and denoator of the right-side ratio are the smaest and argest eigenvaues of Q, respectivey, is exacty the reciproca of the condition number of the matrix Q. We discuss a bit about the two assumptions (A cvx ) and (A L h ). The constant λ expresses the convexity of the oss function, and the strong convexity when λ > 0. For certain nonpoyhedra sparsity functions, this constant needs to be positive in order for interesting resuts to be obtained. One may argue that the atter positivity condition may be too restrictive to be satisfied in appications. For instance, in sparse inear regression where (w) = 1 Aw b, the matrix A can be expected to be coumn rank deficient so that is ony convex but not strongy. For this probem, the resuts beow are appicabe to the reguarized oss function α (w) = (w) + α w for any positive scaar α. More generay, any reguarized convex function is strongy convex and thus satisfies condition (A cvx ) with a positive λ. Another way to soften the positivity of λ satisfying (9) is to require that this inequaity (with a positive λ ) hods ony on a subset of W, e.g., on the set L W, where L { w R m w i = 0 whenever w i = 0 }. In this case, the imizing property of w wi be restricted to the set L W. It turns out that if the functions h i are affine so that the function h(w) = max 1 i I h i (w) is piecewise affine (and convex), the main resut coud hod for λ = 0. Yet another ocaization of the goba inequaity (9) is to assume that the Hessian (w ) exists and is positive definite (either on the entire space R m or on the subspace L ). Under such a pointwise strong convexity condition (via the positive definiteness of the Hessian matrix), one obtains a ocay stricty imizing property of w. The upshot of this discussion is that, in genera, if one desires a strong imizing property to hod on the entire set W, then the inequaity (9) with a positive λ is essentia to compensate for the nonconvexity of the sparsity function P when h i is nonaffine; reaxation of the condition (9) with a positive λ is possibe either with a piecewise inear function h or at the expense of weakening the imizing concusion when the functions h i are not affine. As we wi see in the subsequent sessions, assumption (A L h ) accommodates a host of convex functions h i. Foremost is the case when the functions h i are of cass LC 1 ; in this case, we may take β = 0 and Lip h to be the argest of the Lipschitz modui of the gradients { h i (w)} I on W. In particuar, when each function h i is affine, we may further take Lip h to be zero or any positive scaar. Assumption (A L h ) aso incudes the case where I = 1 and h(w) = w. Indeed, this foows from the foowing inequaity: provided w 0, ( ) w w w T w ( w w ) 1 w w w, which is equivaent to w w w ( w ) T ( w w ) w w.

12 1648 MIJU AHN, JONG-SHI PANG, AND JACK XIN The eft-hand side is equa to w w (w ) T w. We have w w w w (w ) T w ] = w (w ) T w + w w w w w w + w = w w ] 0, which shows that we may take Lip h = 0 and β = 1 for the -norm function. It is important to remark that, in genera, if (10) hods with a zero Lip h, it certainy hods with a positive Lip h. Nevertheess, we refrain from taking this constant as aways positive because it affects the seection of the constant γ as it wi become cear in the main theorem, Theorem 3.. Finay, (A L h ) aso accommodates any inear combination of functions each satisfying this assumption. Such a combination may be reevant in probems of group sparsity imization where one may consider the use of different sparsity functions for different groups. The second assumption (B) beow is on the constraint set W. Consistent with the previous assumptions, we state assumption (B) with respect to an arbitrary vector w W, making it a goba-ike condition; however, it wi be cear from the anaysis that the assumption is most specific to a stationary soution being anayzed. The assumption is satisfied for exampe if W = m W i where each W i is one of three sets: R (for an unrestricted variabe) or R ± (for probems with sign restrictions on the variabes w i ). Thus our resuts are appicabe in particuar to sign constrained probems, a departure from the iterature which typicay either deas with unconstrained probems, or makes an interiority assumption on w which essentiay converts the anaysis to the unconstrained case. (B) For a given vector w W, there exists ε > 0 such that for a ε (0, ε ] and for every i with w i 0, the vectors w±ε;i defined beow beong to W: w ±ε;i j { w j ± ε if j = i, w j if j i, j = 1,..., m. Whie our anaysis is appicabe to a genera convex function g, a more interesting case is when g is a weighted 1 -function given by (11) g(w) ξ i w i, w R m, for some positive constants ξ i. This choice of the function g was empoyed in the study 31] of DC approximations to sparsity imization; Proposition 6 in this reference shows that a broad cass of separabe sparsity functions in the statistica earning iterature has a dc decomposition with the above function g as the convex part. For this choice of g, if η g(w) with g(w) denoting the subdifferentia of g at w, then η 0 = i : w i 0 ξ i ( ) ξ i w 0, 1 i m where η 0 denotes the subvector of η with components corresponding to the nonzero components of w. Let ξ 1 i m ξ i so that η 0 ξ w 0 for a w R m.

13 DC LEARNING The case β = 0. We begin our anaysis with a resut pertaining to the case where condition (A L h ) hods with β = 0. This is the case where each h i is of cass LC 1. The specia property (B) on W is not needed in this case; aso, no assumption is imposed on g except for its convexity. Proposition 3.1. Suppose that (A cvx ) and (A L h ) with β = 0 hod. For any scaar γ > 0 such that δ λ γ Lip h 0, the foowing two statements hod: (a) If w is a d-stationary point of Z γ on W, then for a i A(w ), (w) + γ ( g(w) h i (w) ) ] (w ) + γ ( g(w ) h i (w ) ) ] δ w w w W. Hence w is a imizer of Z γ (w) on W. Moreover, if δ > 0, then w is the unique imizer of (w) + γ ( g(w) h i (w) ) on W for a i A(w ). Hence the unique imizer of Z γ (w) on W. (b) If I = 1, then Z γ is convex on W with moduus δ (thus is strongy convex if δ > 0). Proof. To prove (a), et w W be arbitrary. For any i A(w ), we have (w) + γ ( g(w) h i (w) ) ] (w ) + γ ( g(w ) h i (w ) ) ] (w) + γ ( g(w) h i (w) ) ] (w ) + γ ( g(w ) h i (w ) ) ] (w ) T ( w w ) + γ ( g (w ; w w ) h i (w ) T ( w w ) ) ] = (w) (w ) (w )( w w ) ] + γ g(w) g(w ) g (w ; w w ) ] γ h i (w) h i (w ) h i (w ) T ( w w ) ] ( ) λ γ Lip h w w. 1 Thus (a) foows. To prove (b), suppose I = 1. We have, for any w and w in W, Z γ (w) Z γ (w ) Z γ(w ; w w ) 1 ( λ γ Lip h ) w w. Statement (b) foows readiy from the above inequaity. Statement (b) of the above resut is, in genera, not true if I > 1 as iustrated by the univariate function Z γ (t) = t γ t = t γ max(t, t). (This univariate function fits our framework with g = t and h = t.) It can be seen that for any γ > 0 this function is not convex because Z γ (0) = 0 > γ 4 = 1 Z γ(γ/) + Z γ ( γ/) ]. Nevertheess, this function has two stationary points at ±γ/ that are both goba ima of Z γ. This simpe resut iustrates a coupe noteworthy points. First, the convexity of the function Z γ cannot be expected for I > 1 even though the oss function may be strongy convex. Yet it is possibe for a d-stationary soution of Z γ to be a goba imizer. The atter possibiity is encouraging and serves as the motivation for the subsequent extended anaysis with β > 0, where an upper bound on γ wi persist The case of g being a weighted 1 function. In this subsection, we refine the proof of Proposition 3.1 by taking g to be the weighted 1 function (11) and

14 1650 MIJU AHN, JONG-SHI PANG, AND JACK XIN incuding the constant β. We have, for a w W and a i A(w ), (w) + γ ( g(w) h i (w) ) ] (w ) + γ ( g(w ) h i (w ) ) ] (w) (w ) (w )( w w ) ] + γ g(w) g(w ) g (w ; w w ) ] γ h i (w) h i (w ) h i (w ) T ( w w ) ] ( λ Lip h γ + β ) ] w w w. The remaining anaysis uses the d-stationarity of w to estabish the nonnegativity of the mutipicative factor M γ λ γ ( Lip h + β ) w. By the characterization of d-stationarity in Proposition.1, there exists a subgradient η i g(w ) such that (w ) + γ ( η i h i (w ) ) ] T ( w w ) 0 w W. By assumption (B) on the constraint set W, we may substitute w = w ±ε;k for a k such that wk 0 and for arbitrary ε (0, ε ] and obtain (1) (w ) + γ ( η i h i (w ) ) ] k = 0 k such that w k 0. Thus etting 0 h i (w) be the vector ( h i(w) w k, we have )k:w k 0 w Lip + (0) (w ) = γ η 0 0 h i (w ) γ η 0 0 h i (w ) ], where the first inequaity is by the LC 1 property of and the triange inequaity which aso yieds the ast inequaity. Suppose γ c (0) for a constant c > 0 to be detered momentariy. We deduce w Lip γ η 0 0 h i (w ) c 1 ]. With g being the weighted 1 -function (11) and by recaing the scaar ζ in condition (A B h ), we deduce ] (13) w Lip γ ξ w 0 ζ c 1. Moreover, provided that ξ w 0 > ζ + c 1, we obtain γ w Lip ξ w 0 ζ c. 1 Hence, if w 0 > greater than 1), then Consequenty, if (14) δ γ (c) 1 ] ξ ζ + c 1, (the factor can be repaced by any constant γ w < Lip ( ) ζ+c. Thus M 1 γ 1 λ γlip h β Lip ζ+c. 1 ( ) λ β Lip γ Lip h 0, ζ + c 1 it foows that M γ 0. It remains to detere the constant c. To ensure the vaidity of the above derivations, we need to have 1 (15) c (0) Lip h λ β Lip ζ + c 1,

15 DC LEARNING 1651 which is equivaent to q (c) c ζ (0) Lip h + c (0) Lip h + β Lip λ ζ ] λ 0. Summarizing the above derivations, we have estabished the foowing main resut which does not require further proof. Theorem 3.. Under the assumptions in subsection 3. with the constants λ, Lip, Lip h, β, ζ, and ξ as given therein, et g be given by (11), and et c > 0 be any constant satisfying q (c) 0. Let γ > 0 satisfy (16) c (0) Lip h γ Lip h λ β Lip ζ + c 1. If w is a d-stationary soution of Z γ on W, then either (17) or for a i A(w ) and a w W, w 0 ξ ζ + c 1 ], (w) + γ ( g(w) h i (w) ) ] (w ) + γ ( g(w ) h i (w ) ) ] δ γ (c) w w. If the above inequaity hods, then w is a imizer of Z γ (w) on W. Moreover, if δ γ (c) > 0, then w is the unique imizer of (w) + γ ( g(w) h i (w) ) on W for a i A(w ), hence the unique imizer of Z γ (w) on W. Remark. As shown above, q (c) 0 if and ony if (15) hods. With the atter inequaity, we can choose γ > 0 so that (16) hods. In turn, the inequaities in (16) impicity impose a condition on the four mode constants: λ, Lip, Lip h, and β and suggest a choice of γ in terms of them for the theorem to hod. As a (restricted) goba imizer of Z γ, w has the property that for every w W, either (w ) (w) or P (w ) < P (w). In particuar, if w is not a goba imizer of the oss function on W, then we must have P (w ) < P ( w) where w is the atter imizer. In the anguage of muticriteria optimization such a vector w is a Pareto point on W of the two criteria: oss and mode contro; specificay, there does not exist a w W such that the pair ((w), P (w)) ((w ), P (w )). The impication of the (restricted) goba imizing property of a d-stationary point is reiscent of the cass of pseudoconvex functions 36] which, by their definition, are differentiabe functions with the property that a its stationary points are goba imizers of the functions. There are two differences, however. One difference is that the function Z γ is ony directionay differentiabe and its stationary soutions are defined in terms of the directiona derivatives. Another difference is that when β > 0, a (non)sparsity stipuation (i.e., the faiure of (17)) on the d-stationary soution w is needed for it to be a (restricted) goba imizer. The foowing coroary of Theorem 3. is worth mentioning. For simpicity, we state the coroary in terms of Z γ on the subset W. No proof is required. Coroary 3.3. Under the assumptions of Theorem 3., suppose that Lip h = 0. Let g be given by (11). Let c > 0 satisfy c β Lip λ ζ ] λ. For any γ c (0), if w is a d-stationary soution of Z γ on W, then either λ (18) Z γ (w) Z γ (w ) β Lip ] ζ + c 1 w w w W, or w 0 ( ζ+c 1 ) ξ.

16 165 MIJU AHN, JONG-SHI PANG, AND JACK XIN We cose this section by giving a bound on w 0 based on the inequaity (13). The vaidity of this bound does not require the positivity of λ ; nevertheess, the sparsity bound preassumes a bound on w ; this is in contrast to Theorem 3. which makes no assumption on w except for its d-stationarity. Proposition 3.4. Suppose that assumptions (A cvx ), (A L ), and (B) hod. Let g be given by (11). For every γ > c (0) for some scaar c > 0, if w is a d- stationary soution of Z γ on W such that w ] c (0) Lip ξ L + 1 ζ c 1 for some integer L > 0 then w 0 L. Proof. Assume for contradiction that w 0 L + 1. Then w 0; hence ξ L + 1 > ζ + c 1. We have c (0) ξ L + 1 ζ c 1 ] < γ ξ L + 1 ζ c 1 ] by the choice of γ γ ξ w 0 ζ c 1 ] by assumption on w 0 w Lip by (13) c (0) ξ L + 1 ζ c 1 ] by assumption on w. This contradiction estabishes the desired bound on w Sparsity functions and their dc representations. In this and the next section, we investigate the appication of the resuts in section 3 to a host of nonconvex sparsity functions that have been studied extensivey in the iterature. These nonconvex functions are deviations from the convex 1 -norm that is a we-known convex surrogate for the 0 -function. As mentioned in the introduction, we cassify these sparsity functions into two categories: the exact ones and the surrogate ones. The next section discusses the cass of exact sparsity functions; section 6 discusses the surrogate sparsity functions. In each case, we identify the constants Lip h, β, ζ, and ξ of the sparsity functions and present the speciaization of the resuts in the ast section to these functions. For convenience, we summarize in Tabe 1 the sparsity functions being anayzed. The resuts obtained in the next two sections are summarized in Tabes and 3 appended. 5. Exact sparsity functions. Whie the cass of exact K-sparsity functions has been discussed excusivey in the context of matrix optimization, for the sake of competeness we formay define these functions for vectors and highight their properties most reevant to our discussion. For an m-vector w = (w i ) m, et w ] (w i]) m be derived from w by ranking the absoute vaues of its components in nonincreasing order: max 1 i m w i w 1] w ] w m] 1 i m w i ; thus w k] is the kth argest of the m components of w in absoute vaues and {1],..., m]} is an arrangement of the index set {1,..., m}. For a fixed positive integer K, et P K] (w) w 1 }{{} g K] (w) K w k] k=1 }{{} h K] (w) = k=k+1 w k]. Ceary, P K] (w) = 0 if and ony if w has no more than K nonzero components. The convexity of h K], thus the dc-property of P K], foows from the vaue-function

17 DC LEARNING 1653 Tabe 1 Exact and surrogate penaty functions and their properties. Penaty Difference of convex representation K w K-sparsity 1 w k], k=1 β = Lip h = 0 w 1 w, 1 β = 1; Lip h = 0 0 if w i λ, SCAD Capped 1 Transformed 1 Logarithmic ( w i λ ) λ i w i ( a 1 ) λ w i ( a + 1 ) λ β = 0; Lip h = 1 w i { max 0, w i 1, w } i 1, a i a i a i β = Lip h = 0 w i a i + 1 a i ai + 1 w i (a ] i + 1) w i, a i a i + w i if λ w i a λ, if w i a λ, a i + 1 β = 0; Lip h = max 1 i m a i λ i ] wi w i λ i og( w i + ε i ) + og ε i, ε i ε i λ i β = 0; Lip h = max 1 i m ε i Tabe Summary of genera resuts. w is a d-stationary soution. Resut Condition on constants Concusions Thm. 3. Prop. 3.4 c (0) Lip h γlip h λ βlip ζ + c 1, q (c) 0 c (0) < γ Either w 0 ξ ζ + c 1 ], or w is a imizer on W, unique if γlip h < λ βlip ζ + c 1 w c (0) ] ξ L + 1 ζ c 1 Lip w 0 L. representation beow: (19) h K] (w) where K,m K w k] = maximum v i w i, v K,m { k=1 } v 0, 1 ] m v i = K. Since h K] is piecewise inear, it is LC 1 with Lip h = 0; moreover, we have β = 0. In order to cacuate the constant ζ associated with the function h K], it woud be usefu to express h K] (w) as the pointwise maximum of finitey many inear functions. For this purpose, et E( K,m ) be the finite set of extreme points of the poytope K,m.

18 1654 MIJU AHN, JONG-SHI PANG, AND JACK XIN Tabe 3 Summary of speciaized resuts. q (c) c ζ (0) Lip h +c (0) Lip h + β Lip λ ζ ] λ. Resut Condition on constants Concusions Thm. 5.1 (K-sparsity) Thm. 5.1 (K-sparsity) Thm. 5. ( 1 ) Thm. 6.1 (SCAD) Thm. 6.1 (SCAD) Thm. 6. (Capped 1 ) Thm. 6. (Capped 1 ) Thm. 6.3 (Transformed 1 ) Thm. 6.3 (Transformed 1 ) c (0) < γ, K + 1 > K + c 1 c (0) < γ, λ c Lip λ γ 1 λ c (0) < γ 1 λ c (0) < γ λ c (0) γ a i + 1 max 1 i m a i λ c (0) < γ a i + 1 max 1 i m a i w is a imizer on W w c (0) Lip w 0 K Either w 0 (1 + c 1 ), K + 1 K c 1 ] or w is a goba imizer on W w is a goba imizer on W w c (0) Lip w 0 (0) Lip 1 j m λ j c 1 c w is a goba imizer on W w c (0) Lip w 0 ( ] 1 j m λ j 1 c 1 1 j m a j c (0) 1 j m a j w is a goba imizer on W w c (0) Lip ] Lip ) 1 1 c 1 1 j m a j ( ) ] c i m a i i = 1,..., m, either w i = 0 or w i a i Thm. 6.4 (Logarithmic) Thm. 6.4 (Logarithmic) λ c (0) γ λ i max 1 i m ε i λ c (0) < γ λ i max 1 i m ε i w is a goba imizer on W w c (0) Lip ] λ i c 1 1 i m ε i w 0 c (0) ] λ i ] c 1 1 m ε i Lip 1 i m ε i We then have { m } h K] (w) = max v i σ i w i v E( K,m ); σ { ±1 } m. Note that if v E( K,m ) then each component of v is either 0 or 1 and there are exacty K ones. For a given pair (v, σ) E( K,m ) {±1} m, the inear function h v,σ : w m v i σ i w i has gradient given by v σ, where is the notation for the Hadamard product of two vectors. Hence, 0 h v,σ (w) = i : w i 0 v i σ i = ( K, w 0 ). Since for any nonzero vector w and for any η g K] (w), η 0 = w 0, it foows

Primal and dual active-set methods for convex quadratic programming

Primal and dual active-set methods for convex quadratic programming Math. Program., Ser. A 216) 159:469 58 DOI 1.17/s117-15-966-2 FULL LENGTH PAPER Prima and dua active-set methods for convex quadratic programming Anders Forsgren 1 Phiip E. Gi 2 Eizabeth Wong 2 Received: