Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing Sciences Los Alaos National Laboratory Los Alaos, NM 87545, USA Kevin Lacker Departent of Coputer Science University of California, Berkeley CA 94720, USA Jaes Theiler Space and Reote Sensing Sciences Los Alaos National Laboratory Los Alaos, NM 87545, USA S.PERKINS@LANL.GOV LACKER@EECS.BERKELEY.EDU JT@LANL.GOV Editors: Isabelle Guyon and André Elisseeff Abstract We present a novel and flexible approach to the proble of feature selection, called grafting. Rather than considering feature selection as separate fro learning, grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning fraework. To ake this regularized learning process sufficiently fast for large scale probles, grafting operates in an increental iterative fashion, gradually building up a feature set while training a predictor odel using gradient descent. At each iteration, a fast gradient-based heuristic is used to quickly assess which feature is ost likely to iprove the existing odel, that feature is then added to the odel, and the odel is increentally optiized using gradient descent. The algorith scales linearly with the nuber of data points and at ost quadratically with the nuber of features. Grafting can be used with a variety of predictor odel classes, both linear and non-linear, and can be used for both classification and regression. Experients are reported here on a variant of grafting for classification, using both linear and non-linear odels, and using a logistic regression-inspired loss function. Results on a variety of synthetic and real world data sets are presented. Finally the relationship between grafting, stagewise additive odelling, and boosting is explored. Keywords: Feature selection, functional gradient descent, loss functions, argin space, boosting. 1. Introduction Systes for perforing autoated feature selection have long occupied a strange position, acting as a bridge between the harsh reality of the real world, and the cozy idealistic environents inhabited by ost achine learning algoriths. No wonder then, that feature selection is often seen as soething rather separate fro learning, and altogether uch ore ad hoc and ysterious. As a result, ost feature selection ethods are rather independent of the learning systes they work with. Filter ethods, for exaple the RELIEF algorith (Kira and Rendell, 1992), use a quickly coputed heuristic to estiate the value of each feature, individually or in cobination, and use this c 2003 Sion Perkins, Kevin Lacker and Jaes Theiler.

PERKINS, LACKER AND THEILER to select a set of features before the underlying learning engine ever sees the data. Wrapper ethods (Kohavi and John, 1997), do at least interact with an underlying learning engine, but they typically only counicate with it through brief suaries of perforance, e.g. cross-validation scores. All the other inforation that the learning syste ight have gleaned fro the data is usually ignored when choosing features. More recently however, great efforts have been ade to expand the applicability and robustness of general learning engines, and as a result, the distinction between feature selection and learning is beginning to look a little artificial. After all, are they not just two sides of the sae overall task of learning a good odel, given a set of training data described using a large nuber of features, and without using any special doain knowledge? This observation otivates the approach presented in this paper, where we view feature selection, whether for perforance or pragatic purposes, as part of an integrated criterion optiization schee. Section 2 of this paper presents the regularized risk iniization fraework that fors the core of this schee, and Section 3 introduces a fast, increental ethod for finding optial (or soeties approxiately optial) solutions, which we call grafting. 1 Section 4 provides an epirical coparison of grafting with a nuber of other learning and feature selection techniques, and finally, Section 5 draws conclusions and contrasts grafting with related work in stagewise additive odelling and boosting. 2. Learning to Select Features In this paper, we view feature selection as just one aspect of the single proble of learning a apping based on training data described by a large nuber of features. At the core of this view is a coon odern approach to achine learning, which can be described as regularized risk iniization. In the rest of this section, we review this approach, consider how it can be adapted to include feature selection, and explain why this ight be a good idea. 2.1 Learning as Regularized Risk Miniization First, a few definitions. We assue that we are trying to find a predictor function f ( ) that aps feature 2 vectors x of fixed length n, onto a scalar output value. If we have a binary classification proble then we derive an output label y { 1, +1} with y = sign( f (x)). If we have a regression proble then we produce an output predicted value y = f (x). Since this is a achine learning paper, f (.) is derived, via a learning procedure, fro a training set consisting of randoly sapled (x, y) pairs drawn fro the distribution we re attepting to odel. We assue that f ( ) is a eber of a faily of predictor functions that are paraeterized by a set of paraeters θ. We can specify a particular eber of this faily explicitly as f θ ( ), butwe will often oit the subscript for brevity. We want to coe up with a θ that iniizes the expected risk: R(θ)= L( f θ (x),y) p(x,y)dxdy (1) 1. The nae grafting is derived fro gradient feature testing, for reasons that will becoe clear. 2. The ter feature rather than variable is used throughout this paper to cover the general case where the arguents of f ( ) are theselves functions of the raw variables describing the proble. 1334

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION where L( f (x),y) is a loss function that specifies how uch we are penalized for returning f (x) when the true target value is y;andp(x,y) is the joint probability density function for x and y. For classification probles the ost coon loss function is the isclassification rate: L 1 2 y sign( f (x)). In general we do not know p(x,y) in (1), so it is usually not possible to directly optiize that criterion with respect to θ. Instead, we usually work with the epirical risk R ep calculated fro the training data, with the integral in (1) replaced by a su over all data points. As is well known, directly optiizing the epirical risk can lead to overfitting, so it is coon in odern achine learning to attept to iniize a cobination of the epirical risk plus a regularization ter to penalize over-coplex solutions. That is, we attept to iniize a criterion of the for: C(θ)= 1 L( f θ (x i ),y i )+Ω(θ) (2) where Ω(θ) is a regularization function that has a high value for coplex predictors f. 2.1.1 LOSS FUNCTIONS FOR CLASSIFICATION In theory, we could siply use the error rate as the loss function in (2), but there are two ain probles with this. First, this loss function alost inevitably leads to an optiization proble that is hard to solve exactly. Second, experience and theory have shown that we obtain robust and better generalizing classifiers if we prefer classifiers that separate the data by as wide a argin as possible. Discussion of this phenoenon, and any exaples, can be found in Sola et al. (2000). We can usually iprove generalization perforance and ake the criterion easier to optiize by choosing a loss function that encourages such large-argin solutions. We define the argin for a classifier f on a single data point x with true label y { 1,+1} to be ρ = yf(x). The argin is positive if the point is correctly classified (the sign of f (x) agrees with the sign of y), and negative otherwise. Many coonly used loss functions can be conveniently defined in ters of this argin. Figure 1 illustrates a few of these. In our classification work, we use the Binoial Negative Log Likelihood loss function (Hastie et al., 2001, p. 308): L bnll = ln(1 + e ρ ) This loss function is derived fro a odel that treats f (x) as the log of the ratio of the probability that y =+1 to the probability that y = 1. The ain value of this assuption is that it allows us to calculate p(x) p(y =+1 x) fro f (x) using the following relation: p(x)= e f (x) 1 + e f (x) This loss function can also be readily generalized to a ulti-class classification proble using the ideas of ulti-class logistic regression (Hastie et al., 2001, pp. 95 100). That reference also contains a derivation of the loss function. L bnll defines the BNLL loss for a single data point. It also useful to define the total loss over the training set, also known as the epirical risk, which is the first ter in (2): L BNLL = 1 L bnll 1335

PERKINS, LACKER AND THEILER 4 3.5 3 Exaple Loss Functions Error Rate SVM Perceptron Binoial Negative Log Likelihood 2.5 Loss 2 1.5 1 0.5 0 3 2 1 0 1 2 3 ρ f Figure 1: Coonly used loss functions, plotted as a function of the argin ρ = yf(x). Shown are the SVM loss function: L sv = ax(0,1 ρ); the perceptron criterion: L per = ax(0,ρ); and the binoial negative log-likelihood: L bnll = ln(1 + e ρ ). 2.1.2 REGULARIZATION Possibly the ost straightforward approach to achine learning involves defining a faily of odels and then selecting the odel that iniizes the epirical risk. While this is certainly an oftenused technique, it has two serious probles. The first proble is that for certain cobinations of odel faily, epirical risk function, and training data, the optiization proble can be unbounded with respect to the odel paraeters θ. For exaple, consider the linear odel defined by: f (x) n w i x i + b (3) where w =(w 1,...,w n ) T is a vector of weights, x i is the i th feature of the feature vector x,andb is a constant offset. If the training data is linearly separable, then we can increase the agnitude of w indefinitely to reduce L BNLL. The second proble is the well-known overfitting proble. Given a sufficiently flexible faily of classifiers, it is often possible to find one that has a very low epirical risk, but that generalizes very badly to previously unseen data. Both these probles can be tackled by adding a regularization ter to the epirical risk. The regularization ter (or regularizer) is a function of the odel paraeters that returns a high value for unlikely or coplex odels that are liable to generalize badly. By optiizing a su of the regularizer and the epirical risk, we achieve a trade-off between odel siplicity and epirical risk. If the balance is chosen appropriately then we can often iprove generalization perforance significantly copared with siple epirical risk iniization. The for of the regularizer depends to soe extent on the for of the odel, but here we restrict ourselves to a class of odels where the odel paraeters θ take the for of a vector of real-valued nubers of length p, which we will refer to as a weight vector w. This class of odels includes linear odels and any types of ulti-layer perceptron (MLP). 1336

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION Given this paraeter vector, we can define a coonly eployed faily of regularizers paraeterized by a non-negative integer q, and a vector of positive real nubers α: Ω q (w)=λ p α i w i q (4) Mebers of this regularizer faily correspond to different kinds of weighted Minkowski nor of the paraeter vector, and so Ω q is often referred to as an l q regularizer. Usually, we choose α i {0,1} so as to siply include or exclude certain eleents of w fro the regularization. 3 The essence of these regularizers is that they penalize large values of w i when α i > 0. It is easy to show that the solutions found by unconstrained iniization of (2) using an Ω q regularizer are equivalent to those found by the following constrained optiization proble: iniize s.t. 1 p L( f (x i ),y i ) (5) α i w i q γ There is a one-to-one (but not necessarily siple) correspondence between the paraeters λ and γ. This alternative forulation is useful when we consider the Ω 0 regularizer. The ost interesting ebers of the faily involve q {0,1,2}. The following is a suary of the properties and peculiarities if these three regularizers. Ω 2 This regularizer is seen in ridge regression (Hoerl and Kennard, 1970), the support vector achine (Boser et al., 1992, Schölkopf and Sola, 2002) and regularization networks (Girosi et al., 1995). Those references give various justifications for this for of regularization. One reason for preferring it over other Ω q regularizers is that it is the only one which produces the sae solution under an arbitrary rotation of the feature space axes. The l 2 nor also akes the solution to (2) bounded. As λ is increased, the agnitudes of the eleents of w will tend to decrease, but in general none will go to zero. The l 2 nor is a convex function of weights, and so if the loss function being optiized is also a convex function of the weights, then the regularized loss has a single local (and global) optiu. Ω 1 The l 1 -based regularizer is also known as the lasso. Tibshirani (1994) describes it in great detail and notes that one of its ain advantages is that it often leads to solutions where soe eleents of w are exactly zero. As λ is increased, the nuber of zero weights steadily increases. Unlike the Ω 2 regularizer, using the l 1 nor eans that an arbitrary rotation of the feature axes in general produces a different solution, so the feature axes have special status with this choice of regularizer. Like the Ω 2 regularizer, this regularizer leads to bounded solutions. Siilarly, the l 1 nor is a convex function of weights, and so if the loss function being optiized is also a convex function of the weights, then the regularized loss has a single local (and global) optiu. Ω 0 If we define 0 0 0, then this contributes a fixed penalty α i for each weight w i 0. If all α i are identical, then this is equivalent to setting a liit on the axiu nuber of non-zero weights. The l 0 nor is, however, a non-convex function of weights, and this tends to ake exact optiization of (2) coputationally expensive. 3. For instance in a linear odel we usually want to exclude the constant offset ter. 1337

PERKINS, LACKER AND THEILER 2.2 Feature Selection as Regularization If we have a atheatical expression for a odel in which feature vector eleents only ever appear with an associated ultiplicative weight, then the process of feature selection aounts to producing a odel in which only a subset of weights associated with features are non-zero. Of the regularizers described above, Ω 0 and Ω 1 lead to solutions with soe weights set to exactly zero. But can they be justified in ters of the standard reasons for feature selection? There are any otivations for feature selection, but we will consider two broad classes which generally encopass the reasons ost coonly given. 2.2.1 PRAGMATIC MOTIVATIONS FOR FEATURE SELECTION Often, the otivations for feature selection are pragatic. We wish to reduce training tie; reduce the tie it takes to apply a learned odel to new data; reduce the storage requireents for the odel; or iprove the intelligibility of the odel. We can interpret all of these as either constituting a fixed penalty for including a feature in the odel, or as a constraint on the axiu nuber of features in the odel. For the siplest linear odel case, where each feature appears in the odel associated with just a single weight, then it is easy to see that both of these interpretations correspond to a Ω 0 regularizer with α i > 0 only where w i is the ultiplicative weight on a feature. For ore coplex odels, where each feature ight be associated with several weights, then we can handle this with a slightly odified version of the Ω 0 regularizer: Ω 0 (w)= n α i δ i (6) where δ i = ax j si (w 0 j ),inwhichs i is the set of weight indices associated with the i th feature. If different features carry different costs, for instance if soe features are very expensive to copute, then we can adjust the α i associated with those features accordingly. 2.2.2 PERFORMANCE MOTIVATIONS FOR FEATURE SELECTION The other coon otivation for feature selection is to iprove the generalization perforance of our learned odels. In general, the ore feature diensions a odel includes, the greater its capacity, and hence the greater the tendency for it to overfit the training data the so-called curse of diensionality. But regularization techniques are intended to prevent overfitting, and the question arises: if we use regularization, do we need to do any additional feature selection? Or alternatively, can we achieve iproved generalization perforance by using a regularizer that encourages zero-weighted features in our odel, such as the Ω 1 and Ω 0 regularizers? In order to explore this issue, we perfored a siple experient to copare the generalization perforance of the sae siple linear classifier using Ω 0, Ω 1 and Ω 2 regularizers, in the presence of varying nubers of irrelevant features. We created a sequence of siple n-feature two-class proble as follows. For the first class, the n features for each training exaple are drawn independently fro a noral distribution with ean equal to -1 and standard deviation σ. The n features of the second class are generated in the sae way except a noral distribution with a ean of +1 is used. We then randoly perute the feature values between all the training exaples for all eleents of the feature vector except the first two features. This produces a training set where each feature has exactly the sae distribution, but only the first two are correlated with the class label and the other n 2 1338

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION σ=0.5 σ=1.0 Mean isclassification rate 0.3 0.25 0.2 0.15 0.1 0.05 Ω 0 Ω 1 Ω 2 Unregularized Bayes Error Mean isclassification rate 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 Irrelevant features 0 0 2 4 6 8 10 12 Irrelevant features Figure 2: Coparison of different regularization schees on a proble with varying nubers of irrelevant features and different optial Bayes error. features are irrelevant. Training sets with between 0 and 12 irrelevant features were generated, each containing 10 saples drawn fro each class. We copared probles with σ = 0.5 and σ = 1.0. The forer has a Bayes isclassification rate of 0.0035, the latter has a Bayes isclassification rate of 0.079. Test sets were generated in the sae way, but with 1000 saples in each class. We used a linear odel as in (3) which was trained by optiizing a regularized risk criterion (2), using the logistic regression loss function L bnll and one of the three Ω q regularizers, or using no regularization. For Ω 1 and Ω 2 we used a gradient descent algorith (Nelder and Mead, 1965), and for Ω 0 we used backward eliination (Kohavi and John, 1997), choosing to eliinate the feature that increased the loss function by the least at each step. This is a greedy procedure that ay not produce the optial answer, but exhaustive subset coparison was ipractically slow. The regularization paraeters (λ for the Ω 1 and Ω 2 regularizers, γ for the Ω 0 regularizer) were found by generating ultiple instances of each proble and searching for the values that iniized the average isclassification rate. Figure 2 copares the perforance of the various regularizers as the nuber of irrelevant features is altered. For each proble type, 200 training and test sets were randoly generated. The plots show the ean test score for each type of regularizer, and for the two different values of σ. For reasons of clarity, error bars indicating the standard error of the ean isclassification rate are not shown here, but they are sall copared to the separation between the curves. Both values of σ produce qualitatively siilar results. The unregularized and subset selection (Ω 0 ) experients perfor worst, although subset selection does relatively better when the inforative features are well-separated in the low σ case. Of the other two experients, when ost features are relevant, then Ω 2 regularization slightly outperfors Ω 1 regularization. But as the nuber of irrelevant features increases, Ω 1 regularization takes the lead. Interestingly, when ore than twothirds of the features are irrelevant in this case, the test perforance using Ω 1 regularization sees to level off, while the perforance using Ω 2 regularization continues to degrade. In conclusion, if perforance alone is the key concern, then either Ω 1 or Ω 2, rather than Ω 0, see to be the preferred regularizers. If we expect a large fraction of irrelevant features, then we ight prefer the Ω 1 regularizer. These conclusions are soewhat siilar to those reached by Tibshirani (1994). 1339

PERKINS, LACKER AND THEILER Criterion Ω 0 Ω 1 Ω 2 Models pragatic otivations for feature selection? Yes No No Models perforance otivations for feature selection? No Yes Yes Leads to sparse solutions? Yes Yes No Perforance when ost features are relevant? OK Good Excellent Perforance when ost features are irrelevant? Poor Good OK Convex regularizer? (doesn t add extra local optia) No Yes Yes Nuerical friendliness Poor Good Excellent Table 1: Coparison of three different Ω q regularizers. 2.3 A Unified Optiization Criterion We have argued that for a significant class of odels, described by real-valued paraeter vectors, any otivations for feature selection can be incorporated into a regularized risk iniization fraework, using a suitable cobination of Ω q regularizers. Table 1 suarizes the different qualities of the three Ω q regularizers we have considered. In general, we ight want to use all three, which leads to the following optiization criterion: C(w)= 1 L( f (x),y i )+λ 2 p α 2,i w i 2 + λ 1 p α 1,i w i + λ 0 n α 0,i δ i (7) 3. Grafting We wish to find a iniu of (7), with respect to our odel paraeters. We now consider how this ight be done, and introduce the grafting algorith, as a fast way of getting to an optial or approxiately optial solution, if λ 0 or λ 1 is non-zero. 3.1 Direct Gradient Descent Optiization Probably the ost direct ethod of solution of (7) is to perfor gradient descent with respect to the odel paraeters w, until a iniu is found. If we can copute the gradient of the loss function and the regularization ter(s) with respect to these paraeters, then we can use conjugate gradient or quasi-newton ethods. If not, then we can use a iniization ethod that doesn t require gradient inforation, such as Powell s direction set ethod. See Press et al. (1992, chap. 10) for overviews of these ethods. Unfortunately there are a nuber of probles with this approach. Firstly, the gradient descent can be quite slow. Algoriths such as conjugate gradient descent typically scale quadratically with the nuber of diensions, 4 and we are particularly interested in the doain where we have any features and so p is large. This quadratic dependence on nuber of odel weights sees particularly wasteful if we are using the Ω 0 and/or Ω 1 regularizer, and we know that only soe subset of those weights will be non-zero. The second proble is that the Ω 0 and Ω 1 regularizers do not have continuous first derivatives with respect to the odel weights. This can cause nuerical probles for general purpose gradient descent algoriths that expect these derivatives to be continuous. 4. Conjugate gradient descent requires only O(p) line iniizations, but the gradient calculation required to deterine the direction for each iniization is also typically O(p), giving total coplexity that is closer to O(p 2 ) for large n. 1340

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION Finally, we have the proble of local inia. The Ω 2 and Ω 1 regularizers are convex functions of the weights and so if the loss function being used is also a convex function of weights, then we have a single optiu. The Ω 0 regularizer on the other hand is not convex and introduces any local inia into the optiization proble. 3.2 Stagewise Gradient Descent Optiization If we are using the Ω 1 or Ω 0 regularizer, and we suspect that the nuber N of non-zero weights in the final odel is going to be uch less than the total nuber of weights n, then a ore efficient stagewise optiization procedure suggests itself. We call this algorith grafting. The basic plan is to begin with a odel in which alost all weights are at zero. At each iteration of the grafting procedure, we use a fast gradient-based heuristic to decide which zero weight should be adjusted away fro zero in order to decrease the optiization criterion by the axiu aount. We then perfor gradient descent using that weight and any other non-zero weights in the odel, and continue until no further progress can be ade. 3.2.1 THE BASIC GRAFTING ALGORITHM For ease of presentation, we first consider the case where λ 0 = 0in(7),butλ 1 and λ 2 are non-zero. We will return to the case where λ 0 0 later. At this stage, our discussion applies to a broad class of odels and loss functions. As described above, we assue that the odel we are using is paraeterized by a weight vector w. At any stage in the grafting process, the odel weights are divided into two disjoint sets. Those weights w i F are free to be altered as desired. The reaining weights w i Z( F ) are fixed at zero. We also assue that the output of the odel for a given training exaple is differentiable with respect to the odel weights, i.e. we can calculate f (x i )/ w j for an arbitrary feature vector x i and and an arbitrary weight w j. As explained below, after each grafting step (and before the first step) we iniize (7) with respect to the free weights, so before the k th grafting step, we have: i F C w i = 0 During the k th grafting step, we wish to ove one weight fro Z to F. It sees sensible to select the weight which is going to have the greatest effect on reducing the optiization criterion C. The gradient of the criterion with respect to an arbitrary odel weight w i is: C = 1 w i = 1 ( L f (x i ) ( L f (x i ) ) f (x i ) + 2λ 2 α 2,i w i + λ 1 α 1,i sign(w i ) w i f (x i ) w i ) ± λ 1 α 1,i (8) The contribution fro the Ω 2 ter disappears because w i = 0forallw i Z. Slightly ore subtle is the replaceent of sign(w i ) with ±1, which invites the question of what sign should be used, and whether in fact sign(0) has a well-defined value at all. Recall, however, that we are interested in 1341

PERKINS, LACKER AND THEILER deterining which weight, when adjusted in the appropriate direction, will decrease C at the fastest rate. Consider L TOT / w i, the derivative of the total loss with respect to w i (this is just the first ter in the above expression). Suppose that L TOT / w i > λ 1 α 1,i. Thiseansthat C/ w i > 0, regardless of the sign of w i. In this case, in order to decrease C, we will want to decrease w i.since w i starts at zero, the very first infinitesial adjustent to w i will take it negative. Therefore for our purposes we can let sign(w i )= 1. Siilarly, if L TOT / w i < λ 1 α 1,i, then we can effectively let sign(w i )=+1. Essentially, the effect of the Ω 1 derivative is siply to reduce the agnitude of C/ w i by an aount λ 1 α 1,i. The sae arguent shows that if L TOT / w i < λ 1 α 1,i then it is not possible to produce any local decrease in C by adjusting w i away fro zero. This is the essence of why the Ω 1 regularizer leads to solutions with zero-valued weights, and also provides the basis for one of the two stopping conditions discussed below. At each grafting step, we calculate the agnitude of C/ w i for each w i Z, and deterine the axiu agnitude. We then add the weight to the set of free weights F, and call a general purpose gradient descent routine to optiize (7) with respect to all the free weights. Since we know how to calculate the gradient of C/ w i, we can use an efficient quasi-newton ethod. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for a description). We start the optiization at the weight values found in the k 1 th step, so in general only the ost recently added weight changes significantly. Note that choosing the next weight based on the agnitude of C/ w i does not guarantee that it is the best weight to add at that stage. However, it is uch faster than the alternative of trying an optiization with each eber of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optial. 3.2.2 INCORPORATING THE Ω 0 REGULARIZER Use of the Ω 0 regularizer eans that transferring a weight w i fro Z to F incurs a penalty of λ 0 α 0,i δ i. This fixed penalty akes it substantially harder to deterine which weight is the ost proising one to transfer to F. 5 The heuristic we use in this case is based upon the epirical observation that in a sequence of grafting steps, the agnitude of the ost recently added weight in F typically decreases onotonically. This allows us to estiate an upper liit on the agnitude of the weight we are about to add, which in turn eans we can roughly estiate a bound on the change in C which will result fro adding a weight w i in the grafting step after a weight w j was added: C(w i ) λ 0 α 0,i δ i ( ) w j L TOT w i λ 1α 1,i λ 2 α 2,i w j (9) Picking the best weight to add then aounts to choosing w i Z that iniizes (9). Note that if λ 0 = 0and {w i,w j } Z : α 2,i = α 2, j, then this heuristic is equivalent to the siple heuristic previously discussed. 3.2.3 STOPPING CONDITIONS If only the Ω 2 regularizer is being used, then in general the odel will contain no zero-valued weights, and the grafting procedure will not terinate until Z is epty. In this case there is no advantage in using grafting over full gradient descent optiization. 5. One exception is when all weights in Z incur the sae Ω 0 penalty, as is the case with a siple linear odel where we siply penalize the nuber of included features. In this case it is reasonable to use the standard heuristic. 1342

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION If we are using the Ω 1 regularizer, then we can reach a point where: w i Z L TOT w i λ 1α 1,i At this point it is not possible to ake any further decrease in C by either oving a weight fro Z to F, or by adjusting any weights in F and so we are at a local (and perhaps global) iniu, and can terinate the grafting procedure. If we are using the Ω 0 regularizer, then we ay reach a point where C increases after adding w i to F. We ust then set w i to zero, reove it fro F, and undo the last optiization step (it is convenient to keep a copy of the previous odel around in order to avoid an extra optiization step). We then have a choice. It is possible that a different choice of w i ight lead to a decrease in C, so we could try the optiization step again with the w i associated with the next lowest value of C(w i ). This cycle could be repeated until all reaining weights in Z have been eliinated, and the algorith then terinates. Alternatively we can just terinate the algorith the first tie this happens, recognizing that with the Ω 0 regularizer, our solution will be a greedy approxiation to the optial solution at best. The latter approach is the one we recoend in ost cases. 3.2.4 OPTIMALITY If we have a convex loss function (as a function of weights) and are using just the Ω 2 and/or Ω 1 regularizers (which are theselves convex functions of the weights), then there is only one iniu of (7). Exaination of the stopping conditions above reveal that the grafting algorith is guaranteed to stop at a local optiu, and so grafting is guaranteed to find the global optiu in these cases. As we have seen by now, use of the Ω 0 regularizer akes it uch harder to find an optial solution. The grafting procedure with non-zero λ 0 aounts to a greedy heuristic forward subset selection ethod, which sacrifices optiality in return for fast learning. Whether this is good enough for the proble at hand depends on the situation. One should note however, that as λ 0 is ade saller and saller relative to λ 1, then the chances of ending up in a sub-optial situation decrease. Hence we are inclined to ake λ 0 fairly sall in ost cases. 3.2.5 COMPUTATIONAL COMPLEXITY We have claied that grafting is substantially faster than full gradient descent. We will now exaine this clai ore carefully. If there are p weights in our weight vector, then full gradient descent requires soe ultiple of p line iniizations to optiize our criterion, let s say cp iniizations. 6 Deterining the direction requires p derivatives C/ w i to be coputed. The coputation of each derivative is doinated by the coputation of L TOT / w i, which is siply a weighted su of siple derivatives f (x j )/ w i. The line iniizations theselves require a few O() function evaluations, but if p is large, then this is a inor contribution. If we denote the tie taken to calculate one siple derivative as τ, then the total tie taken for full gradient descent is cp 2 τ. Under grafting we will select soe nuber s weights before the algorith terinates. Since we select one weight at each grafting step we take s steps. The k th step consists of two phases. First we evaluate C/ w i for each of the (p k) weights in Z. As noted above, the derivative calculation takes τ, and so the tie devoted to gradient testing over s steps is spτ. The 6. Here, c = 1 if our criterion is a perfect quadratic for, and c > 1otherwise. 1343

PERKINS, LACKER AND THEILER second phase involves optiizing with respect to the k free weights. At ost this ight take ck line iniizations, but it should take less than this since ost of the free weights will be close to their optial values. To copensate for using a constant c that is probably too high here, we again ignore the tie taken for the line iniizations theselves (soe sall ultiple of τ). Therefore the tie taken for optiization at the k thstepis ck 2 τ.overs steps, this is 1 3 cs3 τ. Putting this together the total grafting run tie is (sp + 1 3 cs3 )τ. If we assue that s p then it is clear that the grafting algorith should be substantially quicker than the cp 2 τ required for full gradient descent. Also note that the full gradient descent algorith has to deal with discontinuities in the gradient which can slow it down significantly. By keeping zero-valued weights out of the optiization steps, grafting avoids this difficulty. 3.2.6 NORMALIZATION In order to ake the gradient agnitude heuristic a fair coparison, it is usually iportant to noralize all features so that they have approxiately the sae scale. Before we begin, we linearly scale all feature vectors so that each feature has a ean value of zero and has a standard deviation of one. It is of course iportant to scale testing data using the sae scaling paraeters derived fro the training data. 3.3 Grafting Exaples It is helpful to illustrate the grafting algorith in ore detail for soe particular odels and loss functions. Here we will concern ourselves only with binary classification probles and so a suitable loss function to use is the binoial negative log likelihood L bnll. For siplicity, we will also assue only Ω 1 and Ω 0 regularization are used, and that all α 1,i {0,1}. 3.3.1 LINEAR MODEL We first consider linear odels with n + 1 weights, of the for: f (x)= n w i x i + w 0 If we define the argin for a given training pair (x i,y i ) as ρ i = y i f (x i ), then the following is the regularized optiization criterion: C(w)= 1 + e (1 ρ n i )+λ 1 w i + λ 0 s (10) where s is the nuber of selected features. Note that the constant offset ter w 0 does not appear in the regularizer since we do not want to penalize a ere translation of the linear discriinant surface. The derivatives we need (ignoring the Ω 0 ter for now) are: C = 1 w j where x i, j is the j th coponent of x i. = 1 L bnll ρ i ρ i w j ± λw j 1 1 + e ρ y ix i i, j ± λw j 1344

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION It is instructive to interpret this derivative in a geoetric way. First, we iagine an -diensional argin space, which has one diension for each training point. If we think of each of the coordinate axes as representing the argins ρ i for the current odel on the point x i, then the total loss function can be thought of as a function over this space, and we can calculate the full gradient of that function: ( L ρ L =, L,..., L ) T ρ 1 ρ 2 ρ In the sae argin space we can also iagine the feature argin vector r j : Given this we can write: r j =(y 1 x 1, j,y 2 x 2, j,...,y x, j ) T C w j = ρ L r j ± λw j Since the grafting heuristic for selecting the next weight to add to the odel is only interested in the agnitude of this derivative, and since the regularizer coponent always acts to reduce this agnitude, we can see that picking the next weight aounts to choosing the feature argin vector that is ost well-aligned with the direction of steepest descent of the loss function in argin space. For the linear odel, we initialize F to contain just w 0 and perfor an initial optiization. We then proceed in the usual grafting fashion, picking one new weight to add to F at each grafting step until an Ω 0 or Ω 1 stopping condition is reached. In this case, each weight corresponds to one feature. The siplicity of the linear odel eans that it soeties cannot fit the training data very well. But this proble is soewhat reduced when we have any features since in general the extra features will tend to ake the proble ore linearly separable. 3.3.2 MULTI LAYER PERCEPTRON MODEL For a ore powerful odel, we ight use an MLP with h hidden units having sigoid transfer functions and a linear output unit with unit output weights. We can write this MLP function as: ) f (x)= h j=1 g ( n w ( j) i x i + w ( j) 0 + w (0) 0 In this odel w ( j) i is the weight fro the i th feature to the j th hidden unit. The sigoid transfer function g( ) is defined as: g(x)= 2 1 + e x 1 which is the usual neural net sigoid function, scaled vertically so that g(0)=0, g(+ )=+1and g( )= 1, which akes the net slightly better behaved under grafting. The optiization criterion is alost identical to (10) with the different f ( ) substituted. All the weights in the MLP odel are included equally in the regularization ter, except for the bias weights on each node which are excluded, and the constant weights on the output node. Rather than use a fixed nuber of hidden units, the grafting procedure we describe here allows hidden units to be added as the grafting process continues. In the MLP odel, each weight added corresponds to 1345

PERKINS, LACKER AND THEILER a new connection in the network. We begin with F containing just the output bias weight w (0) 0 and a network with no hidden units, and perfor an initial optiization with respect to just that bias weight. At each grafting step we decide how to grow the network in one of two possible ways: Adding a new hidden unit: If there are k 1 hidden units already, this involves adding a hidden node, along with a new connection to the output node, and a new connection fro the hidden node to a feature input, with weight w (k) i. The question becoes: which feature should be connected to the new hidden unit? The derivative we need is: C w (k) i = 1 j=1 1 1 + e ρ y jx i j,i ± λw (k) i Adding an input connection to an existing hidden unit: Each of the h hidden units ay be connected to any of the n input features and any of these connections are candidates for adding to the odel (if they are not present already). If we are considering adding a connection fro the i th input feature to the k th hidden unit, then the relevant derivative is: ) C = 1 1 w (k) l x i,l + w (k) 0 ± λw (k) i with: w (k) i j=1 1 + e ρ i y jx j,i g ( n l=1 2e x g (x)= (1 + e x ) 2 After t grafting steps, there are n possibilities to consider for adding a new hidden unit, and (hn t) possibilities to consider for adding a connection to a new hidden unit. Following the usual grafting procedure, we calculate the derivatives for all these candidates and pick the one with the largest agnitude. The corresponding weight is added to F and we reoptiize. The cycle is repeated until one of the stopping conditions is et. If a new hidden unit is added, then we also need to include the associated bias weight, initialized at zero, in F. 3.4 Variants A nuber of variants to the basic grafting procedure are possible. One interesting alternative is not to attept a full optiization of the full set F after each grafting step. Instead only the ost recently added weight and perhaps the bias weights are adjusted. This akes each grafting step faster, at the expense of a loss in accuracy and the strong possibility of ending up at a solution that is not even a local optiu. In practice, if only a sall fraction of the possible weights are non-zero when grafting finishes, then the tie spent checking gradients to deterine the next weight to add often doinates the run tie, and so saving a little effort on the optiization akes little difference. Grafting can also be readily extended to regression probles through the use of a suitable loss function, such as the squared error loss function. This has not yet been ipleented. 4. Grafting Experients In this section, we copare the perforance of grafting and a nuber of other different approaches to feature selection, on a set of synthetic and real world test probles. For siplicity, we concentrate entirely on binary classification probles in this paper. 1346

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION 4.1 The Datasets Five datasets were used in these experients, labeled A through E. Each dataset consists of a training set and a test set. Datasets A, B and C are synthetic probles, and are all instances of the sae basic task described below. Datasets D and E are real world probles, taken fro the online UCI Machine Learning Repository (Blake and Merz, 1998). The three synthetic probles are variations of a task we call the threshold ax (TM) proble. In the ost basic version of this proble, the feature space contains n r inforative features, each of which is uniforly distributed between -1 and +1. The output label for a given point is defined as: { +1 if ax(xi ) > 2 y = (1 1/nr) 1 ; i = 1...n r -1 Otherwise The y = 1 points occupy a large hypercube wedged into one corner of the larger hypercube containing all the points. The y =+1 points fill the reaining space. The constant in the above expression is chosen so that half the feature space belongs to each class. Variations of this basic proble are derived by adding n i irrelevant features uniforly distributed between -1 and +1, and n c redundant features which are just copies of the inforative features. The TM proble is designed so that each of the inforative features only provides a little inforation, but by using all of the together, the proble is copletely separable. In addition, the optial discriinating surface for the proble is very non-linear, but the proble is asyetric so a linear discriinant should be able to do at least better than rando. Ten instantiations of the training and testing sets of each of the three synthetic probles were generated to obtain soe statistics on relative algorith perforance. In ore detail, the datasets are: Dataset A The TM proble, with n r = 10, n c = 0andn i = 90. Both the training set and the test set contain 1000 points each. This dataset explores the effect of irrelevant features in the TM proble. Dataset B The TM proble, with n r = 10, n c = 90 and n i = 0. Both the training set and the test set contain 1000 points each. This dataset explores the effect of redundant features in the TM proble. Dataset C The TM proble, with n r = 10, n c = 490 and n i = 500. The training set contains only 100 training points, despite the proble diensionality of 1000. The test set contains 1000 points. This dataset explores an extree situation in which there are any ore features than training points. Dataset D The Multiple Features database fro the UCI repository. This is actually a handwritten digit recognition task, where digitized iages of digits have been represented using 649 features of various types. The task tackled here is to distinguish the digit 0 fro all other digits. The training and test sets both consist of 1000 points. The features were all scaled to have zero ean and unit variance before being used here. Dataset E The Arrhythia database fro the UCI repository. The task here is to distinguish noral fro abnoral heartbeat behavior fro ECG data described by 279 nueric and binary attributes. The data was slightly odified fro the original to ake it easier to use. Feature nuber 14 ( J ) was issing in ost of the records, so it was reoved fro all the records. Of the 452 1347

PERKINS, LACKER AND THEILER instances in the database, 32 had other issing attribute values and so those instances were also reoved, leaving 420 instances described by 278 attributes. These were divided into a training set of 280 points, and a test set of 140 points. All the datasets used in these experients can be found online at: http://nis-www.lanl.gov/ sies/data/jlr03/ 4.2 The Algoriths Eight different algoriths, which we denote by the letters (a) through (h), were copared on the five datasets described above. Except where described below, all ipleentations relied on Matlab (including the Matlab Optiization Toolbox). (a) The linear grafting algorith described in Section 3.3, and using both Ω 0 and Ω 1 regularization. (b) The MLP grafting algorith described in Section 3.3, and using both Ω 0 and Ω 1 regularization. (c) Siple gradient descent to fit an Ω 1 regularized linear cobination of all the input features. After gradient descent is coplete, any weights with a agnitude less than 10 4 of the axiu agnitude in the weight vector, are pruned in order to obtain the subset selection driven by the Ω 1 regularization. (d) Siple gradient descent to fit an Ω 1 regularized, fully connected MLP of a siilar for to that learned by the MLP grafting algorith, with the exception that the nuber of hidden nodes is fixed at 10 nodes (and of course the connectivity is uch higher). After gradient descent is coplete, any weights with a agnitude less than 10 4 of the axiu agnitude in the weight vector, are pruned in order to obtain the subset selection driven by the Ω 1 regularization. (e) Linear SVM, which effectively uses Ω 2 regularization and a slightly different loss function fro that used by the grafting ipleentations. The SVM ipleentation we used is libsv (Chang and Lin, 2001). (f) Gaussian RBF kernel SVM, using the default libsv kernel paraeters. (g) Gaussian RBF kernel SVM as above, but in conjunction with wrapper feature subset selection. At each feature selection step, all possible features are considered for addition to the current feature set, and 3-fold cross-validation is used to select the best one. This process is repeated until we have selected 10 features, and a final RBF SVM is trained using just those features. This is essentially greedy forward subset selection. (h) Gaussian RBF kernel SVM, but in conjunction with a filter feature subset selection. For each dataset, we siply take the 10 features that are ost highly correlated with the label and train our SVM using those features. Note that an efficient ipleentation of grafting ust directly exploit the sparsity of the odels being trained. Our Matlab ipleentations take care to do this. 1348

GRAFTING: FAST, INCREMENTAL FEATURE SELECTION Linear Graft/GD MLP Graft/GD Linear SVM RBF SVM (λ 1 /λ 0 ) (λ 1 /λ 0 ) (C) (C) A 3 10 4 /0.005 10 6 /0.005 0.01 10 B 10 4 /0.005 10 6 /0.005 0.01 0.1 C 0.2/0.001 0.1/0.001 0.001 100 D 0.005/0.001 3 10 4 /0.001 1 1 E 0.05/0.001 3 10 5 /0.001 0.01 1 Table 2: Regularization paraeters used in experients for each dataset, chosen using five-fold cross-validation. GD stands for Gradient Descent. 4.3 Experiental Details Each algorith was applied to each of the datasets. The exception to this was the fully connected MLP (algorith (d)), which failed to converge for several of the larger datasets. We suspect that this was caused by the large nuber of weights close to zero in the odel, and the discontinuous derivative of the regularizer at this point. For the three synthetic probles the training runs were repeated for ten different rando instantiations of the training and test sets, to assess sensitivity to sall changes in the dataset. The regularization paraeters used in these experients were chosen using five-fold cross validation on each of the training sets. Algoriths (a) and (c) share the sae paraeters; as do algoriths (b) and (d); and algoriths (f), (g) and (h). Note that the grafting algoriths, (a) and (b), use both Ω 1 and Ω 0 regularization, requiring paraeters λ 1 and λ 0, while the corresponding siple gradient descent algoriths, (c) and (d), use only Ω 1 regularization. This is due to the difficulty of incorporating Ω 0 regularization into a standard gradient descent algorith. The paraeters are listed in Table 2. For the Matlab ipleentations (algoriths (a), (b), (c) and (d)) the training tie on each dataset was recorded to see if grafting gave a speedup over siple gradient descent. A direct speed coparison between the SVM algoriths (written in C) and the Matlab ipleentations was not attepted at this tie, due to the inherent efficiency differences between Matlab and C code. We also recorded the nuber of features selected (not the sae as the nuber of weights in general), for those algoriths that select a subset of the features (all algorith except (e) and (f)). In addition, for the three synthetic probles, we can directly calculate a easure of how useful the selected feature subsets are. Recall that in each of these probles there are n r inforative features, not including redundant duplicates. We can define a feature set saliency easure: s = n g n r + n g n f 1 where n g is the nuber of good features selected (i.e. inforative features, but not including any duplicate redundant features), and n f is the total nuber of features selected. The saliency evaluates to 1 if all n r good features are selected and no others are selected. It evaluates to -1 if no good features are selected, and it evaluates to approxiately zero if all the good features are selected, but only along with any irrelevant or redundant features. Note that the nuber of selected features for the linear and MLP odels trained by siple gradient descent (algoriths (c) and (d)) relies on a soewhat subjective pruning of low valued weights, as described above. Therefore the easureents of nubers of selected features, and 1349