Another Look at Linear Programming for Feature Selection via Methods of Regularization 1

Size: px

Start display at page:

Download "Another Look at Linear Programming for Feature Selection via Methods of Regularization 1"

Frank Byrd
5 years ago
Views:

1 Another Look at Linear Programming for Feature Seection via Methods of Reguarization Yonggang Yao, The Ohio State University Yoonkyung Lee, The Ohio State University Technica Report No. 800 November, 2007 Department of Statistics The Ohio State University 958 Nei Avenue Coumbus, OH This is the first revision of Technica Report 800, Department of Statistics, The Ohio State University, Juy Some numerica resuts in Section 6 have been added.

2 Another Look at Linear Programming for Feature Seection via Methods of Reguarization Yonggang Yao and Yoonkyung Lee The Ohio State University, USA Summary. We consider statistica procedures for feature seection defined by a famiy of reguarization probems with convex piecewise inear oss functions and penaties of nature. Many known statistica procedures (e.g. quantie regression and support vector machines with norm penaty) are subsumed under this category. Computationay, the reguarization probems are inear programming (LP) probems indexed by a singe parameter, which are known as parametric cost LP or parametric right-hand-side LP in the optimization theory. Expoiting the connection with the LP theory, we ay out genera agorithms, namey, the simpex agorithm and its variant for generating reguarized soution paths for the feature seection probems. The significance of such agorithms is that they aow a compete exporation of the mode space aong the paths and provide a broad view of persistent features in the data. The impications of the genera path-finding agorithms are outined for a few statistica procedures, and they are iustrated with numerica exampes. Keywords: -norm penaty; Parametric inear programming, Quantie regression; Simpex method; Structured earning; Support vector machines. Introduction Reguarization methods cover a wide range of statistica procedures for estimation and prediction, and they have been used in many modern appications. To name a few, exampes are ridge regression (Hoer and Kennard, 970), the LASSO regression (Tibshirani, 996), smoothing spines (Wahba, 990), and support vector machines (SVM) (Vapnik, 998). Given a training data set, {(y i, x i ) : x i R p ; i =,, n}, many statistica probems can be phrased as the probem of finding a functiona reationship between the covariates, x R p, and the response y based on the observed pairs. For exampe, a reguarization method for prediction ooks for a mode f(x; β) with unknown parameters β that minimizes a prediction error over the training data whie controing its mode compexity. To be precise, et L(y, f(x; β)) be a convex oss function for the prediction error and J(f(x; β)) be a convex penaty functiona that measures the mode compexity. The training error with respect to L is defined by L(Y, f(x; β)) := n i= L(y i, f(x i ; β)), where Y := (y,, y n ) and X := (x,, x n). Formay, the soution to a reguarization probem is defined to be f with the mode parameters ˆβ that minimize: L(Y, f) + λ J(f), () where λ 0 is a pre-specified reguarization parameter. The λ determines the trade-off between the prediction error and the mode compexity, and thus the quaity of the soution highy depends on the choice of λ. Identification of a proper vaue of the reguarization parameter for mode seection or a proper range for mode averaging is a critica statistica probem. Note that ˆβ(λ) is a function of λ. As in (), each reguarization method defines a continuum of optimization probems indexed by a tuning parameter. In most cases, the soution as a function of the tuning parameter is expected to change continuousy with λ. This aows for the possibiity of compete exporation of the mode space as λ varies, and computationa savings if () is to be optimized for mutipe vaues of λ. Aternativey, the reguarization probem in () can be formuated to bound the mode compexity or the penaty. In this compexity-bounded formuation, the optima parameters are sought by minimizing: L(Y, f) s.t. J(f) s, (2) Address for correspondence: Yoonkyung Lee, Department of Statistics, The Ohio State University, 958 Nei Ave, Coumbus, OH 4320, USA. E-mai: ykee@stat.osu.edu

3 where s is an upper bound of the compexity. For a certain combination of the oss L and the compexity measure J, it is feasibe to generate the entire soution path of the reguarization probem. Here, the path refers to the entire set of soutions to the reguarization probem, for instance, ˆβ(λ) in () as a function of λ (or ˆβ(s) in (2) as a function of s). Some pairs of the oss and the compexity are known to aow such fast and efficient path finding agorithms; for instance, LARS (Efron et a., 2004), the standard binary SVM (Hastie et a., 2004), the muti-category SVM (Lee and Cui, 2006), and the -norm quantie regression (Li and Zhu, 2005). Rosset and Zhu (2007) study genera conditions for the combination of L and J such that soutions indexed by a reguarization parameter are piecewise inear and thus can be sequentiay characterized. They provide generic path-finding agorithms under some appropriate assumptions on L and J. In this paper, we focus on an array of reguarization methods aimed for feature seection with penaties of nature and piecewise inear oss functions. Many existing procedures are subsumed under this category. Exampes incude the -norm SVM (Zhu et a., 2004) and its extension to the muti-cass case (Wang and Shen, 2006), -norm quantie regression (Li and Zhu, 2005), Sup-norm muti-category SVM (Zhang et a., 2006), the functiona component seection step (caed θ-step ) for structured muti-category SVM (Lee et a., 2006), and the Dantzig seector (Candes and Tao, 2005). We aso note that the ǫ-insensitive oss in the SVM regression (Vapnik, 998) fits into the category of a piecewise inear oss, and the sup norm gives rise to a inear penaty just as the norm in genera. There is a great commonaity among these methods. That is, computationay the associated optimization probems are a inear programming (LP) probems indexed by a singe reguarization parameter. This famiy of LP probems are known as the parametric cost inear programming and have ong been studied in the optimization theory. Furthermore, there aready exist efficient agorithms for the soution paths. Despite the commonaity, so far, ony case-by-case treatments of some of the probems are avaiabe as in Zhu et a. (2004); Li and Zhu (2005) and Wang and Shen (2006). Athough Wang and Shen (2006) notice that those soution path agorithms have fundamenta connections with the parametric right-hand-side LP (see (8) for the definition), such connections have not been adequatey expored for other probems with generaity. As noted, Rosset and Zhu (2007) have a comprehensive take on the computationa properties of reguarized soutions, however they did not tap into the LP theory for genera treatments of the probems of our focus. The goa of this paper is to make it more expicit the ink between the parametric LP and a famiy of computationa probems arising in statistics for feature seection via reguarization and put those feature seection probems in perspectives. To this end, we pu together resuts from the inear programming iterature and summarize them in an accessibe and sef-contained fashion. Section 2 begins with an overview of the standard LP and parametric LP probems, and gives a brief account of the optimaity conditions for their soutions. Section 3 introduces the simpex agorithm and the tabeau-simpex agorithm for finding the entire soution paths of the parametric LP probems. Section 4 describes a few exampes of LP for feature seection, paraphrasing their computationa eements in the LP terms. A detaied comparison of the simpex agorithm with the existing agorithm for the -norm SVM (Zhu et a., 2004) is given in Section 5 highighting the generaity of the proposed approach. Numerica exampes and data appication of the agorithm foow in Section 6 for iustration. Technica proofs except for the key theorems are coected into Appendix. 2. Linear Programming Linear programming (LP) is one of the cornerstones of the optimization theory. Since the pubication of the simpex agorithm by Dantzig in 947, LP has quicky found its appications in operation research, microeconomics, business management, and other engineering fieds. We give an overview of LP here and describe the optimaity conditions of the LP soution pertinent to our discussion of path-finding agorithms. Some properties of the LP to be described are we known in the optimization iterature, but we incude them and their proofs for competeness. Our treatment of LP cosey foows that in standard references such as Bertsimas and Tsitsikis (997) and Murty (983). The readers are referred to them and references therein for more compete discussions. 2

4 2.. Standard Linear Programs A standard form of LP is min z R N s.t. c z Az = b z 0, (3) where z is an N-vector of variabes, c is a fixed N-vector, b is a fixed M-vector, and A is an M N fixed matrix. Without oss of generaity, it is assumed that M N and A is of fu rank. Standard techniques for soving LP incude simpex method, dua simpex method, tabeau method, and interior point methods. Geometricay speaking, the standard LP probem in (3) searches the minimum of a inear function over a poyhedron whose edges are defined by hyperpanes. Therefore, if there exists a fixed soution for the LP probem, at east one of the intersection points (formay caed basic soutions) of the hyperpanes shoud attain the minimum. For forma discussion of the optimaity, a brief review of some terminoogies in LP is provided. Let N denote the index set {,, N} of the unknowns, z, in the LP probem in (3). DEFINITION. A set B := {B,, BM } N is caed a basic index set, if A B := [A B,, A BM ] is invertibe, where A B i is the Bi th coumn vector of A for i B. A B is caed the basic matrix associated with B. Correspondingy, a vector z R N is caed the basic soution associated with B, if z satisfies { z B := (z B,, z B ) = M B b zj = 0 for j N \ B. DEFINITION 2. Let z be the basic soution associated with B. z is caed a basic feasibe soution if z B 0; z is caed a non-degenerate basic feasibe soution if z B > 0; z is caed a degenerate basic feasibe soution if z B 0 and z Bi = 0 for some i M := {,, M}; z is caed an optima basic soution if z is a soution of the LP probem. Since each basic soution is associated with its basic index set, the optima basic soution can be identified with the optima basic index set as defined beow. DEFINITION 3. A basic index set B is caed a feasibe basic index set if B b 0. A feasibe basic index set B is aso caed an optima basic index set if [ c A ( ) ] B cb 0. The foowing theorem indicates that the standard LP probem can be soved by finding the optima basic index set. THEOREM 4. For the LP probem in (3), et z be the basic soution associated with B, an optima basic index set. Then z is an optima basic soution. PROOF. We need to show c z c z or c (z z ) 0 for any feasibe vector z R N with Az = b and z 0. Denote d := (d,, d N ) := (z z ). From Ad = A B d B + i N \B A i d i = 0, we have d B = i N \B B A id i. 3

5 Then, c (z z ) = c d = c B d B + c i d i i N \B = (c i c B A A B i)d i. i N \B [ Reca that for i N \ B, zi = 0, which impies d i := (z i zi ) 0. Together with c A ( ) ] B cb it ensures (c i c B A B A i)d i 0. Thus, we have c d 0. 0, 2.2. Parametric Linear Programs In practica appications, the cost coefficients c or the constraint constants b in (3) are often partiay known or controabe so that they may be modeed ineary as (c +λa) or (b + ωb ) with some parameters λ and ω R. A famiy of reguarization methods for feature seection to be discussed share this characteristic. Athough every parameter vaue creates a new LP probem in the setting, it is feasibe to generate soutions for a vaues of the parameter via sequentia updates. The new LP probems indexed by the parameters are caed the parametric-cost LP and parametric right-hand-side LP, respectivey. For reference, see Bertsimas and Tsitsikis (997), p , and Murty (983), p The standard form of a parametric-cost LP is defined as min z R N s.t. (c + λa) z Az = b z 0. Since the basic index sets of the parametric-cost LP do not depend on the parameter λ, it is not hard to see that an optima basic index set B for some fixed vaue of λ woud remain optima for a range of λ vaues, say, [λ, λ]. The interva is caed the optimaity interva of B for the parametric-cost LP probem. COROLLARY 5. For a fixed λ 0, et B be an optima basic index set of the probem in (4) at λ = λ. Define ( ) λ := max č j (5) {j : ǎ j > 0; j N \ B } and λ := min {j : ǎ j < 0; j N \ B } ǎ j ( č j ǎ j where ǎ j := a j a B A B A j and č j := c j c B A B A j for j N. Then, B is an optima basic index set of (4) for λ [λ, λ], which incudes λ. PROOF. From the optimaity of B for λ = λ, we have B b 0 and [ c A ( ) ] [ B cb + λ a A ( ) ] B ab 0, which impies that č j + λ ǎ j 0 for j N. To find the optimaity interva [λ, λ] of B, by Theorem 4, we need to investigate the foowing inequaity for each j N : ), č j + λǎ j 0. (6) It is easy to see that A B Bi = e i for i M since A B i is the ith coumn of A B. Consequenty, the jth entries of (c c B A B A) and (a a B A B A) are both 0 for j B, and č j + λǎ j = 0 for any λ. So, the inequaity hods for any λ R and j B. When ǎ j > 0 (or ǎ j < 0) for j (N \ B ), (6) hods if and ony if λ č j/ǎ j (or λ č j /ǎ j ). Thus, the ower bound and the upper bound of the optimaity interva of B are given by the λ and λ in (5). (4) 4

6 Note that č j and ǎ j define the reative cost coefficient of z j. Since the number of basic index sets is finite for fixed A, there exist ony a finite number of optima basic index sets of the probem in (4). Coroary 5 impies that a version of the soution path of the probem as a function of λ, z(λ), is a step function. On the other hand, if the parametric cost LP in (4) is recast in the form of (2), then the property of the soution path changes. The aternative compexity-bounded formuation of (4) is given by min z R N, δ R s.t. c z Az = b a z + δ = s z 0, δ 0. It can be transformed into a standard parametric right-hand-side LP probem: min Þ R N+ s.t. Þ AÞ = + ω (8) Þ 0 [ ] [ ] [ ] [ ] [ ] z c b by setting ω = s, Þ =, =, =, 0 =, and A A 0 = δ 0 0 a. Note that when A in (8) is of fu rank, so is A. Let B be an optima basic index set of (8) at ω = ω. Simiary, we can show that B is optima for any ω satisfying Þ B = B ( + ω ) 0, and there exist ω and ω such that B is optima for ω [ω, ω]. This impies that a version of the soution path of (8) is a piecewise inear function. (7) 3. Generating the Soution Path Based on the basic concepts and the optimaity condition of LP introduced in Section 2, we describe agorithms to generate the soution paths for (4) and (7), namey, the simpex and tabeau-simpex agorithms. Since the exampes of the LP probem in Section 4 for feature seection invove non-negative a, λ, and s ony, we assume that they are non-negative in the foowing agorithms and take s = 0 (equivaenty λ = ) as a starting vaue. 3.. Simpex Agorithm 3... Initiaization Let z 0 := (z 0,, z 0 N ) denote the initia soution of (7) at s = 0. a z 0 = 0 impies z 0 j = 0 for a j / I a := {i : a i = 0, i N }. Thus, by extracting the coordinates of c, z, and the coumns in A corresponding to I a, we can simpify the initia LP probem of (4) and (7) to min z Ia R Ia s.t. c Ia z Ia A Ia z Ia = b z Ia 0, (9) where I a is the cardinaity of I a. Accordingy, any initia optima basic index set, B 0 of (4) and (7) contains that of the reduced probem (9) and determines the initia soution z Main Agorithm For simpicity, we describe the agorithm for the soution path of the parametric-cost LP probem in (4) first, and then discuss how it aso soves the compexity-bounded LP probem in (7). Let B be the th optima basic index set at λ = λ. For convenience, define λ :=, the starting vaue of the reguarization parameter for the soution path of (4). Given B, et z be the th joint soution, which is given by z B = B b and z j = 0 for j N \ B. Since the optima LP soution is identified by the optima basic index set as in Theorem 4, it suffices to describe how to update the optima basic index set as λ decreases. By the invertibiity of 5

7 the basic matrix associated with the index set, updating amounts to finding a new index that enters and the other that exits the current basic index set. By Coroary 5, we can compute the ower bound of the optimaity interva of B denoted by λ and identify the entry index associated with it. Let ( ) j := arg max {j : ǎ j > 0; j (N \ B )} č j ǎ j, (0) where ǎ j := (a j a B A B j ) and č j := (c j c B A B j). Then, the ower bound is given by λ := č j /ǎ j, and B is optima for λ [λ, λ ]. To determine the index exiting B, consider the moving direction from z to the next joint soution. Define d := (d,, d N ) as d B = A A B j, d j =, and () d i = 0 for i N \ (B {j }). Lemma 2 in Appendix shows that d is the moving direction at λ = λ in the sense that z + = z + τd for some τ 0. For the feasibiity of z + 0, the step size τ can not exceed the minimum of zi /d i for i B with d i < 0, and the index attaining the minimum is to eave B. Denote the the exit index by i := arg min i {j: d j <0, j B } ( z i d i ). (2) Therefore, the optima basic index set at λ = λ is given by B + := B {j } \ {i }. More precisey, we can verify the optimaity of B + at λ = λ by showing that (c + λ a) A ( B + ) (cb + + λ a B +) (3) = (c + λ a) A ( B ) (cb + λ a B ). The proof is given in Appendix B. Then the fact that B is optima at λ = λ impies that B + is aso optima at λ = λ. As a resut, the updating procedure can be repeated with B + and λ successivey unti λ < 0 or equivaenty č j 0. The agorithm for updating the optima basic index sets is summarized as foows. (a) Initiaize the optima basic index set at λ = with B 0. (b) Given B, the th optima basic index set at λ = λ, determine the soution z by z B 0 for j N \ B. (c) Find the entry index ( ) (d) Find the exit index j = i = arg max j : ǎ j > 0; j N \ B č j ǎ j ( ) arg min z i i {j: d j <0, j B } d. i If there are mutipe indices, choose one of them. (e) Update the optima basic index set to B + = B {j } \ {i }. (f) Terminate the agorithm if č j 0 or equivaenty λ 0. Otherwise, repeat = B b and z j = If z i /d i = 0, then z = z +, which may resut in the probem of cycing among severa basic index sets with the same soution. We defer the description of the tabeau-simpex agorithm which can avoid the cycing probem to Section 3.2. For brevity, we just assume that z + τd 0 for some τ > 0 so that z z + for each and ca this non-degeneracy assumption. Under this assumption, suppose the simpex agorithm terminates after J iterations with {(z, λ ) : = 0,,, J}. Then the entire soution path is obtained as described beow. 6

8 THEOREM 6. The soution path of (4) is z 0 for λ > λ 0 z for λ < λ < λ, =,, J τz + ( τ)z + for λ = λ and τ [0, ], = 0,, J. (4) Likewise, the soutions to the aternative formuation of (7) with the compexity bound can be obtained as a function of s. By the correspondence of the two formuations, the th joint of the piecewise inear soution is given by s = a z, and the soution between the joints is a inear combination of z and z + as described in Theorem 7 beow. Its proof is in Appendix C. THEOREM 7. For s 0, the soution path for (7) can be expressed as { s+ s s + s z + s s s + s z + if s s < s + and = 0,, J z J if s s J Tabeau-Simpex Agorithm The non-degeneracy assumption in the simpex method that any two consecutive joint soutions are different may not hod in practice for many probems. When some coumns of a basic matrix are discrete, the assumption may fai at some degenerate joint soutions. To dea with more genera settings where the cycing probem may occur in generating the LP soution path by the simpex method, we discuss the tabeau-simpex agorithm. A tabeau is a big matrix which contains a the information about the LP. It consists of the reevant terms in LP associated with a basic matrix such as the basic soution and the cost. DEFINITION 8. For a basic index set B, its tabeau is defined as cost row penaty row pivot rows zeroth coumn c B A B b a B A B b B b pivot coumns c c B A A B A B A A B a a B A We foow the convention for the names of the coumns and rows in the tabeau. For reference, see Murty (983) and Bertsimas and Tsitsikis (997). Note that the zeroth coumn contains z B := A B b, the non-zero part of the basic soution, c B A B b = c z, the negative cost, and a B A B b = a z, the negative penaty of z associated with B, and the pivot coumns contain č j s and ǎ j s. The agorithm to be discussed updates the basic index sets by using the tabeau, in particuar, by ordering some rows of the tabeau. To describe the agorithm, we introduce the exicographic order of vectors first. DEFINITION 9. For v and w R n, we say that v is exicographicay greater than w (denoted by v L > w) if the first non-zero entry of v w is stricty positive. We say that v is exicographicay positive if v L > 0. Consider the parametric-cost LP in (4) Initia Tabeau With the index set B 0, initiaize the tabeau. Since z 0 B = 0 B0 b 0 and the coumns of A can be rearranged such that the first M coumns of B0 A are I, we assume that the pivot rows, [A B b 0 A B0 A], of the initia tabeau are exicographicay positive. In other words, there is a permutation π : N N which maps B 0 to M := {,, M}, and we can repace the probem with the π-permuted version (e.g., z π(n) and A π(n) ). 7

9 Updating Tabeau Given the current optima basic index set B, the current tabeau is cost row penaty row pivot rows zeroth coumn c B B b B b B b a B pivot coumns c c B A B A B A A B a a B Suppose a the pivot rows of the current tabeau are exicographicay positive. The tabeau-simpex agorithm differs from the simpex agorithm ony in the way the exit index is determined. The foowing procedure is generaization of Step 4 in the simpex agorithm for finding the exit index. Step 4. Let u := (u,, u M ) := A B j. For each i M with u i > 0, divide the ith pivot row (incuding the entry in the zeroth coumn) by u i. And, among those rows, find the index, i, of the exicographicay smaest row. Then, i := B i is the exit index. Remark Since u = d B, if i in (2) is unique with z i > 0, then it is the same as the exicographicay smaest row that the tabeau-simpex agorithm seeks. Hence the two agorithms coincide. The simpex agorithm determines the exit index based ony on the zeroth coumn in the tabeau whie the exicographic ordering invoves the pivot coumns additionay. The optimaity of B for λ [λ, λ ] immediatey foows by the same step 3, and (3) remains to hod true for the exit index i of the tabeau-simpex agorithm, which impies the optimaity of B + at λ = λ. Some characteristics of the updated tabeau associated with B + are described in the next theorem. The proof is adapted from that for the exicographic pivoting rue in Bertsimas and Tsitsikis (997) p. 08. See Appendix D for detais. THEOREM 0. For the updated basic index set B + by the tabeau-simpex agorithm, i) a the pivot rows of the updated tabeau are sti exicographicay positive, and ii) the updated cost row is exicographicay greater than that for B. Since B +b is the zeroth coumn of the pivot rows, i) says that the basic soution for B+ is feasibe, i.e., z + 0. Moreover, it impies that the updating procedure can be repeated with B + and the new tabeau. It is not hard to see that z + = z if and ony if z i = 0 (see the proof of Theorem 0 in the Appendix for more detais). When z i = 0, z + = z, however the tabeau-simpex agorithm uniquey updates B + such that the previous optima basic index sets B s never reappear in the process. This anti-cycing property is guaranteed by ii). By ii), we can stricty order the optima basic index sets B based on their cost rows. Because of this and the fact that a possibe basic index sets are finite, the tota number of iterations must be finite. This proves the foowing. COROLLARY. The tabeau updating procedure terminates after a finite number of iterations. Suppose that the tabeau-simpex agorithm stops after J iterations with λ J 0. In parae to the simpex agorithm, the tabeau-simpex agorithm outputs the sequence {(z, s, λ ) : = 0,, J}, and the soution paths for (4) and (7) admit the same forms as in Theorem 6 and Theorem 7 except for any dupicate joints λ and s. 4. Exampes of LP for Reguarization A few concrete exampes of LP probems that arise in statistics for feature seection via reguarization are given. For each exampe, the eements in the standard LP form are identified, and their commonaities and how they can be utiized for efficient computation are discussed. 8

10 4.. -Norm Quantie Regression Quantie regression is a regression technique, introduced by Koenker and Bassett (978), intended to estimate the conditiona quantie functions. It is obtained by repacing the squared error oss of the cassica inear regression for the conditiona mean function with a piecewise inear oss caed the check function. For a genera introduction to quantie regression, see Koenker and Haock (200). For simpicity, assume that the conditiona quanties are inear in the predictors. Given a data set, {(x i, y i ) : x i R p, y i R, i =,, n}, the τth conditiona quantie function is estimated by min β 0 R, β R p n i= ρ τ (y i β 0 x i β), (5) where β 0 and β := (β,...,β p ) are the quantie regression coefficients for τ (0, ), and ρ τ ( ) is the check function defined as { τ t for t > 0 ρ τ (t) := ( τ) t for t 0. For exampe, when τ = /2, it ooks for the median regression function. The standard quantie regression probem in (5) can be cast as an LP probem itsef, and enumeration of the entire range of quantie functions parametrized by τ is feasibe as noted in Koenker (2005), p.85. Since it is somewhat different from an array of statistica optimization probems for feature seection that we intend to address in this paper, we eave an adequate treatment of this topic esewhere and turn to a reguarized quantie regression. Aiming at estimating the conditiona quantie function simutaneousy with seecting reevant predictors, Li and Zhu (2005) propose the -norm Quantie Regression. It is defined by the foowing constrained optimization probem: { min β 0 R, β R p n i= ρ τ (y i β 0 x i β) s.t. β s, where s > 0 is a reguarization parameter. Equivaenty, with λ Q = τ/( τ), and λ another tuning parameter, the -norm quantie regression can be recast as { min β 0 R, β R p, ζ R n n i= {(ζ i) + + λ Q (ζ i ) } + λ β s.t. β 0 + x i β + ζ i = y i for i =,, n, where (x) + = max(x, 0) and (x) = max( x, 0). (6) can be formuated as an LP parametrized by λ, which is a common feature of the exampes discussed. For the non-negativity constraint in the standard form of LP, consider both positive and negative parts of each variabe and denote, for exampe, ((β ) +,...,(β p ) + ) by β + and ((β ),..., (β p ) ) by β. Note that β = β + β and the -norm β := p i= β i is given by (β + + β ) with := (,, ) of appropriate ength. Let Y := (y,, y n ), X := (x,, x n ), ζ := (ζ,, ζ n ), and 0 := (0,, 0) of appropriate ength. Then the foowing eements define the -norm quantie regression in the standard form of a parametric-cost LP in (4): z := ( β 0 + β0 (β + ) (β ) (ζ + ) (ζ ) ) c := ( λ Q ) a := ( ) A := ( X X I I ) b := Y with a tota of N = 2( + p + n) variabes and M = n equaity constraints. (6) Norm SVM Consider a binary cassification probem where y i {, }, i =,, n denote the cass abes. The Support Vector Machine (SVM) introduced by Cortes and Vapnik (995) is a cassification method that finds the optima hyperpane maximizing the margin between the casses. It is another exampe of a reguarization method with a margin based 9

11 hinge oss and the ridge regression type 2 norm penaty. The optima hyperpane (β 0 + xβ = 0) in the standard SVM is determined by the soution to the probem: min β 0 R, β R p n { y i (β 0 + x i β)} + + λ β 2 2, i= where λ > 0 is a tuning parameter. Repacing the 2 norm with the norm for seection of variabes as in Bradey and Mangasarian (998) and Zhu et a. (2004), we arrive at a variant of the soft-margin SVM: { min n β 0 R, β R p, ζ R n i= (ζ i) + + λ β (7) s.t. y i (β 0 + x i β) + ζ i = for i =,, n. Simiary, this -norm SVM can be formuated as a parametric cost LP with the foowing eements in the standard form: z := ( β 0 + β0 (β + ) (β ) (ζ + ) (ζ ) ) c := ( ) a := ( ) A := ( Y Y diag(y )X diag(y )X I I ) b :=. This exampe wi be revisited in great detais in Section Norm Functiona Component Seection We have considered ony inear functions in the origina variabes for conditiona quanties and separating hyperpanes so far. In genera, the technique of norm reguarization for variabe seection can be extended to nonparametric regression and cassification. Athough many different extensions are possibe, we discuss here a specific extension for feature seection which is we suited to a wide range of function estimation and prediction probems. In a nutshe, the space of inear functions is substituted with a rich function space such as a reproducing kerne Hibert space (Wahba, 990; Schökopf and Smoa, 2002) where functions are decomposed of interpretabe functiona components, and the decomposition corresponds to a set of different kernes which generate the functiona subspaces. Let an ANOVA-ike decomposition of f with, say, d components be f = f + +f d and K ν, ν =,..., d be the associated kernes. Nonnegative weights θ ν are then introduced for recaibration of the functiona components f ν. Treating f ν s as features and restricting the norm of θ := (θ,..., θ d ) akin to the LASSO eads to a genera procedure for feature seection and shrinkage. Detaied discussions of the idea can be found in Lin and Zhang (2006); Gunn and Kandoa (2002); Zhang (2006); Lee et a. (2006). More generay, Micchei and Ponti (2005) treat it as a reguarization procedure for optima kerne combination. For iustration, we consider the θ-step of the structured SVM in Lee et a. (2006), which yieds another parametric cost LP probem. For generaity, consider a k-category probem with potentiay different miscassification costs. The cass abes are coded by k-vectors; y i = (y i,..., yk i ) denotes a vector with y j i = and /(k ) esewhere if the ith observation fas into cass j. L(y i ) = (L y i,..., L k y i ) is a miscassification cost vector, where L j j is the cost of miscassifying j as j. The SVM aims to find f = (f,...,f k ) cosey matching an appropriate cass code y given x which induces a cassifier φ(x) = argmax j=,...,k f j (x). Suppose that each f j is of the form β j 0 +hj (x) := β j 0 + n i= βj i d ν= θ νk ν (x i, x). Define the squared norm of h j as h j 2 K := (βj ) ( d ν= θ νk ν ) β j, where β j := (β j,...,βj n) is the jth coefficient vector, and K ν is the n by n kerne matrix associated with K ν. With the extended hinge oss L{y i, f(x i )} := L(y i ){f(x i ) y i } +, the structured SVM finds f with β and θ minimizing n L(y i ){f(x i ) y i } + + λ 2 i= k h j 2 K + λ θ j= d θ ν (8) subject to θ ν 0 for ν =,..., d. λ and λ θ are tuning parameters. By aternating estimation of β and θ, we attempt to find the optima kerne configuration (a inear combination of pre-specified kernes) and the coefficients associated ν= 0

12 with the optima kerne. The θ-step refers to optimization of the functiona component weights θ given β. More specificay, treating β as fixed, the weights of the features are chosen to minimize { } ( k d (L j ) β j 0 + θ ν K ν β j y j + λ k d ) d 2 ν= j=(β j ) θ ν K ν β j + λ θ θ ν, ν= ν= j= where L j := (L j y,...,l j y n ) and y j = (y j,..., yj n). This optimization probem can be rephrased as min k ζ R nk, θ R d j= (Lj ) (ζ j ) + + λ ( d 2 ν= θ k ν j= (βj ) K ν β j) + λ d θ ν= θ ν d s.t. ν= θ νk ν β j ζ j = y j β j 0 for j =,...,k θ ν 0 for ν =,...,d. + Let g ν := (λ/2) k j= (βj ) K ν β j, g := (g,, g d ), L := ( (L ),, (L k ) )., and ζ := ((ζ ),, (ζ )) k Aso, et X := K β K d β K β k K d β k Then the foowing eements define the θ-step as a parametric cost LP indexed by λ θ with N = d + 2nk variabes and M = nk equaity constraints:. z := ( θ (ζ + ) (ζ ) ) c := ( g L 0 ) a := ( 0 0 ) A := ( X I I ) b := ((y β 0),..., (y k β k 0) ) Computation The LP probems for the foregoing three exampes share a simiar structure that can be expoited in computation. First of a, the A matrix has both I and I as its sub-matrices, and the entries of the penaty coefficient vector a corresponding to I and I in A are zero. Thus, the ranks of A and A Ia are M, and the initia optima soution exists and can be easiy identified. Due to the specia structure of A Ia, it is easy to find a basic index set B Ia for the initia LP probem in (9) which gives a feasibe soution. For instance, a feasibe basic soution can be obtained by constructing a basic index set B such that for b j 0, we choose the jth index from those for I, and otherwise from the indices for I. For the θ-step of structured SVM, B itsef is the initia optima basic index set, and it gives a trivia initia soution. For the -norm SVM and -norm quantie regression, the basic index set defined above is not optima. However, the initia optima basic index set can be obtained easiy from B. In genera, the tabeau-simpex agorithm in Section 3 can be used to find the optima basic index set of a standard LP probem, taking any feasibe basic index set B as a starting point. The necessary modification of the agorithm for standard LP probems is that the entry index j N is chosen to satisfy č j < 0 at Step 3. For B, a but the indices j for β + 0 and β 0 satisfy č j 0. Therefore, one of the indices for β 0 wi move into the basic index set first by the agorithm, and it may take some iterations to get the initia optima index set for the two reguarization probems. A tabeau contains a the information on the current LP soution and the terms necessary for the next update. To discuss the computationa compexity of the tabeau updating agorithm in Section 3.2.2, et T denote the tabeau, an (N + ) (M + 2) matrix associated with the current optima basic index set B. For a compact statement of the updating formua, assume that the tabeau is rearranged such that the pivot coumns and the pivot rows precede the zeroth coumn and the cost row and the penaty row, respectivey. For the entry index j and exit index i defined in the agorithm, T j denotes its j th coumn vector, T i the i th row vector of T, and T i j the i j th entry of T. The proof of Theorem 0 in Appendix D impies the foowing updating formua: T + = T T i j (T j e i ) T i. (9)

13 Therefore, the computationa compexity of the tabeau updating is approximatey O(MN) for each iteration in genera. For the three exampes, tabeau update can be further streamined. Expoiting the structure of A with paired coumns and fixed eements in the tabeau associated with B, we can compress each tabeau, retaining the information about the current tabeau, and update the reduced tabeau instead. We eave discussion of impementation detais esewhere, but updating such a reduced tabeau has the compexity of O((N g M)M) for each iteration, where N g is the reduced number of coumns in A counting ony one for each of the paired coumns. As a resut, when the tabeau agorithm stops in J iterations, the compexity of both -norm SVM and -norm QR as a whoe is O((p + )nj) whie that of the θ-step of structured SVM is roughy O(dnkJ), where p is the number of variabes, d is the number of kerne functions, and k is the number of casses. 5. A Coser Look at the -Norm Support Vector Machine Taking the -norm SVM as a case in point, we describe the impications of the tabeau-simpex agorithm for generating the soution path. Zhu et a. (2004) provide a specific path-finding agorithm for the -norm SVM in the compexity-bounded formuation of (7) and give a carefu treatment of this particuar probem. We discuss the correspondence and generaity of the tabeau-simpex agorithm in comparison with their agorithm. 5.. Status Sets For the SVM probem with the compexity bound s, i.e. β s, et β 0 (s) and β(s) := (β (s),, β p (s)) be the optima soution at s. Zhu et a. (2004) categorize the variabes and cases that are invoved in the reguarized LP probem as foows: Active set: A(s) := {j : β j (s) 0, j = 0,,...,p} Ebow set: E(s) := {i : y i {β 0 (s) + x i β(s)} =, i =,...,n} Left set: L(s) := {i : y i {β 0 (s) + x i β(s)} <, i =,..., n} Right set: R(s) := {i : y i {β 0 (s) + x i β(s)} >, i =,...,n}. Now, consider the soution z(s) given by the tabeau-simpex agorithm as defined in Section 4.2 and the equaity constraints of Az(s) = b, that is, Az(s) := β 0 (s)y + diag(y )Xβ(s) + ζ(s) =. It is easy to see that for any soution z(s), its non-zero eements must be one of the foowing types, and hence associated with A(s), L(s), and R(s): β + j (s) > 0 or β j ζ + i (s) > 0 and ζ i ζ + i (s) = 0 and ζ i (s) > 0 (but not both) j A(s); (s) = 0 i L(s); (s) > 0 i R(s). On the other hand, if ζ + i (s) = 0 and ζ i (s) = 0, then i E(s), the ebow set Assumption Suppose that the th joint soution at s = s is non-degenerate. Then z j (s ) > 0 if and ony if j B. This gives A(s ) + L(s ) + R(s ) = n. Since E(s) L(s) R(s) = {,..., n} for any s, the reationship that A(s ) = E(s ) must hod for a the joint soutions. In fact, the equaity of the cardinaity of the active set and the ebow set is stated as an assumption for uniqueness of the soution in the agorithm of Zhu et a. (2004). The impicit assumption of z B > 0 at each joint impies z + z, the non-degeneracy assumption for the simpex agorithm. Thus the simpex agorithm is ess 2

14 restrictive. In practice, the assumption that joint soutions are non-degenerate may not hod, especiay when important predictors are discrete or coded categorica variabes such as gender. For instance, the initia soution of the -norm SVM vioates the assumption in most cases, requiring a separate treatment for finding the next joint soution after initiaization. In genera, there coud be more than one degenerate joint soutions aong the soution path. This woud make the tabeau-simpex agorithm appeaing as it does not rey on any restrictive assumption Duaity in Agorithm To move from one joint soution to the next, the simpex agorithm finds the entry index j. For the -norm SVM, each index is associated with either β j or ζ i. Under the non-degeneracy assumption, the variabe associated with j must change from zero to non-zero after the joint (s > s ). Therefore, ony one of the foowing events as defined in Zhu et a. (2004) can happen immediatey after a joint soution: β j (s ) = 0 becomes β j (s) 0, i.e., an inactive variabe becomes active; ζ i (s ) = 0 becomes ζ i (s) 0, i.e., an eement eaves the ebow set and joins either the eft set or the right set. In conjunction with the entry index, the simpex agorithm determines the eaving index, which accompanies one of the reverse events. The agorithm in Zhu et a. (2004), driven by the Karush-Kuhn-Tucker optimaity conditions, seeks the event with the smaest oss/ s, in other words, the one that decreases the cost with the fastest rate. The simpex agorithm is consistent with this existing agorithm. As in (0), reca that the entry index j is chosen to minimize (č j/ǎ j) among j N \ B with ǎ j > 0. N \ B contains those indices corresponding to j / A(s ) or i E(s ). Anaogous to the optima moving direction d in (), define v j = (v j,...,vj N ) such that v j B = B A j, v j j =, and vj i = 0 for i N \ (B {j}). Then ǎ j := (a j a B A B j ) = a v j s j and č j := (c j c B A B j) = c v j oss j. Thus, the index chosen by the simpex agorithm in (0) maximizes the rate of reduction in the cost, oss/ s. The existing -norm SVM path agorithm needs to sove roughy p groups of E -variate inear equation systems for each iteration. Its computationa compexity can be O(p E 2 + p L ) if Sherman-Morrison updating formua is used. On the other hand, the computationa compexity of the tabeau-simpex agorithm is O(pn) for each iteration as mentioned in Section 4. Therefore, the former coud be faster if n/p is arge; otherwise, the tabeau-simpex agorithm is faster. Most of the arguments in this section aso appy for the comparison of the simpex agorithms with the extended soution path agorithm for the -norm muti-cass SVM by Wang and Shen (2006). 6. Numerica Resuts We iustrate the use of the tabeau-simpex agorithm for the parametric LP in statistica appications with two simuated exampes and anaysis of rea data, and discuss mode seection or variabe seection probems therein. 6.. Cassification Consider a simpe binary cassification probem with inear cassifiers. In this simuation, 0-dimensiona independent covariates are generated from the standard norma distribution, x := (x,..., x 0 ) N(0, I), and the response variabe is generated via the foowing probit mode: Y = sign(β 0 + xβ + ǫ), (20) where ǫ N(0, σ 2 ), and x and ǫ are assumed to be mutuay independent. Let φ(x) = sign(ˆβ 0 + xˆβ) denote a inear cassifier with ˆβ 0 and ˆβ estimated from data. Under the probit mode, the theoretica error rate of φ(x) can be anayticay obtained as foows. Given β 0 and β, { } Pr Y sign(ˆβ 0 + Xˆβ) { ( ) } u 0 + û uz = Φ ( û 0 ) + E Φ sign(z + û ( + /SNR) (û 0 ), u) 2 3

15 where Φ( ) is the cumuative distribution function of the standard norma distribution, Z is a standard norma random variabe, u 0 := β 0 / β 2, u := β/ β 2, û 0 := ˆβ 0 / ˆβ 2, and û := ˆβ/ ˆβ 2. The SNR refers to the signa-to-noise ratio defined as var(xβ)/σ 2 in this case. Note that the error rate is invariant to scaing of (ˆβ 0, ˆβ). Setting σ 2 = 50, β 0 = 0, and β = (2, 0, 2, 0, 2, 0, 0, 0, 0, 2), we have the SNR of Then, for the Bayes decision rue, in particuar, we have the error rate of Pr {Y sign(β 0 + Xβ)} = 2 π arctan SNR 0.336, (2) which is the minimum possibe vaue under the probit mode (20). Figure shows the coefficient paths of the -norm SVM indexed by og(λ) (piecewise constant) and s (piecewise inear) for a simuated data set of size 400 from the mode. Ceary, as /λ or s increases, those estimated coefficients corresponding to the non-zero β j s (j =, 3, 5, and 0) grow arge very quicky. The error rate associated with the soution at each point of the paths is theoreticay avaiabe for this exampe, and thus the optima vaue of the reguarization parameter can be defined. However, in practice, λ (or s) needs to be chosen data-dependenty, and this gives rise to an important cass of mode seection probems in genera. For the feasibiity of data-dependent choice of λ, we carried out cross vaidation and made comparison with the theoreticay optima vaues. The dashed ines in Figure indicate the optima vaues of λ (or s) chosen by five-fod cross vaidation with 0- oss (bue) and hinge oss (red), respectivey. The discontinuity of the 0- oss tends to give jagged cross vaidation curves, which have an adverse effect on identification of the optima vaue of the tuning parameter. To increase the stabiity, one may smooth out individua cross vaidated error rate curves by averaging them over different spits of the data. To that effect, cross vaidation was repeated 50 times with respect to the 0- oss and the hinge oss for averaging. β β og(λ) s Fig.. The soution paths of the -norm SVM for simuated data. The numbers at the end of the paths are the indices of β s. The vaues of og(λ) (or s) with the minimum five-fod cross vaidated error rate and hinge oss are indicated by the bue and red dashed ines, respectivey. Figure 2 dispays the path of average miscassification rates from five-fod cross vaidation over the training data and the true error rate path for the -norm SVM under the probit mode. The true error rates were approximated by numerica integration up to the precision of 0 4. Seection of s by cross vaidation with the 0- oss and hinge oss gave very simiar resuts. The smaest error rate achieved by the -norm SVM for this particuar training data set is approximatey 0.34, which is fairy cose to the Bayes error rate. We observe that the inear cassifiers at both of the chosen vaues incude the four reevant predictors Quantie Regression For another exampe, consider a quantie regression probem where covariates are simuated by the same setting as in the previous exampe, but a continuous response variabe is defined by Y = β 0 + xβ + ǫ. Under the assumption that 4

16 CV Error Rate Error Rate s s Fig. 2. Error rate paths. The eft pane shows the path of average miscassification rates from five-fod cross vaidation of the training data repeated 50 times, and the right pane shows the true error rate path for the -norm SVM under the probit mode. The vertica dashed ines are the same as in Figure. The cross on the path in the right pane pinpoints the vaue of s with the minimum error rate, and the gray horizonta dashed ine indicates the Bayes error rate. ǫ N(0, σ 2 ), the theoretica τth conditiona quantie function is given by m τ (x) = σφ (τ) + β 0 + xβ. Restricting to inear functions ony, suppose that an estimated τth conditiona quantie function is f(x) = ˆβ 0 + xˆβ. With respect to the check function as a oss, one can cacuate the risk of f, which is defined by R(f; β 0, β) := E {τ(y ˆβ 0 Xˆβ) + + ( τ)(y ˆβ } 0 Xˆβ) = τ Φ ˆβ 0 β 0 σ 2 + β ˆβ 2 (β 0 ˆβ 0 ) 2 σ β ˆβ { } 2 2 (ˆβ 0 β 0 ) 2 exp 2π 2(σ 2 + β ˆβ 2 2 ). For each τ, the true risk of m τ (x) is (σ/ 2π)exp{ Φ (τ) 2 }, which represents the minima achievabe risk. Note that the maximum of the minima risks occurs when τ = 0.5 in this case, i.e., for the median, and the true conditiona median function is m 0.5 (x) = β 0 + xβ with the risk of σ/ 2π Figure 3 shows the coefficient paths of the -norm median regression appied to simuated data of size 400. Simiary, Figure 4 shows the corresponding path of the averaged 0-fod cross vaidated risk with respect to the check oss from 0 repetitions and its corresponding theoretica risk path. At the chosen vaue of λ by cross vaidation, the four correct predictors and one extra predictor have non-zero coefficients, and the theoretica risk of the seected mode is not far from the minima risk denoted by the horizonta reference ine. We note that the compete risk path eves off roughy after ogλ = 5, impying that moderatey reguarized modes are amost as good as the fu mode of the unconstrained soution. In terms of the risk, the reaized benefit of penaization appears itte compared to the previous cassification exampe Income Data Anaysis For a rea appication, we take the income data in Hastie et a. (200), which are extracted from a marketing database for a survey conducted in the Bay area (987). The data set is avaiabe at tibs/eemstatlearn/. It consists of 4 demographic attributes with a mixture of categorica and continuous variabes, which incude age, gender, education, occupation, marita status, househoder status 5

17 β β og(λ) s Fig. 3. The soution paths of the -norm median regression for simuated data. The dashed ines specify the vaue of the reguarization parameter with the minimum of 0-fod cross vaidated risk with respect to the check oss over the training data. (own home/rent/other), and annua income among others. The main goa of the anaysis is to predict the annua income of the househod (or persona income if singe) from the other 3 demographics attributes. The origina response of the annua income takes one of the foowing income brackets: < 0, [0, 5), [5, 20), [20, 25), [25, 30), [30, 40), [40, 50), [50, 75), and 75 in the unit of $,000. For simpification, we created a proxy numerica response by converting each bracket into its midde vaue except the first and the ast ones, which were mapped to some reasonabe vaues abeit arbitrary. Removing the records with missing vaues yieds a tota of 6,876 records. Because of the granuarity in the response, the norma-theory regression woud not be appropriate. As an aternative, we considered median regression, in particuar, the norm median regression for simutaneous variabe seection and prediction. In the anaysis, each categorica variabe with k categories was coded by (k-) 0- dummy variabes with the majority category treated as the baseine. Some genuiney numerica but bracketed predictors such as age were aso coded simiary as the response. As a resut, 35 variabes were generated from the 3 origina variabes. The data set was spit into a training set of 2,000 observations and a test set of 4,876 for evauation. A the predictors were centered to zero and scaed to have the squared norm equa to the training sampe size before fitting modes. Inspection of the margina associations of the origina attributes with the response necessitated incusion of a quadratic term for age. We then considered inear median regression with the main effect terms ony (35 variabes pus the quadratic term) and with two-way interaction terms as we as the main effects. There are potentiay 53 two-way interaction terms by taking the product of each pair of the normaized main effect terms from different attributes. In an attempt to excude neary constant terms, we screened out any product with the reative frequency of its mode 90% or above. This resuted in addition of 69 two-way interactions to the main effects mode. Note that the interaction terms were put in the partia two-way interaction mode without further centering and normaization for the carity of the mode. Approximatey three quarters of the interactions had their norms within 0% difference from that of the main effects. Figure 5 shows the coefficient paths of the main effects mode in the eft pane and the partia two-way interaction mode in the right pane for the training data set. The coefficients of the dummy variabes grouped for each categorica variabe are of the same coor. In both modes, severa variabes emerge at eary stages as important predictors of the househod income and remain important throughout the paths. Among those, the factors positivey associated with househod income are home ownership (in dark bue reative to renting), education (in brown), dua income due to marriage (in purpe reative to not married ), age (in skybue), and being mae (in ight green). Marita status and occupation are aso strong predictors. As opposed to those positive factors, being singe or divorced (in red reative to married ) and being a student, cerica worker, retired or unempoyed (in green reative to professionas/managers) are negativey associated with the income. So is the quadratic term of age in bue as expected. In genera, it woud be too simpistic to assume that the demographic factors in the data affect the househod income additivey. Truthfu modes woud need to take into account some high order interactions, refecting 6

Another Look at Linear Programming for Feature Selection via Methods of Regularization 1

Another Look at Linear Programming for Feature Selection via Methods of Regularization 1 Yonggang Yao, SAS Institute Inc. Yoonkyung Lee, The Ohio State University Technical Report No. 8r April, 21 Department