Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal Montreal, QC, Canada H3C 3J7 godbougi@iro.umontreal.ca Editor: Yoshua Bengio Keywords: Data Mining, Variable Selection, l 1 -norm Regularization, Gradient Directed Regularization 1. Introduction Bishop(2) establishes that in practical applications of data mining, the choice of preprocessing of the available data will be one of the most significant factors in determining the performance of the final system. Data mining situations are often caracterized with the availability of a large number of raw input input variables, sometimes in the tens of thousands range, and comparably few training examples. Learning algorithms perform well on domain with a relatively small number of relevant variables. They tend to degrade however in presence of a large number of variables including possibly irrelevant and redundant information. Many approaches have been proposed to address the problems of relevance and space dimensionality reduction of the input variables. They include algorithms for feature extraction, variable and feature selection, and example selection to name a few. This document present the report on a session project for the course IFT6266 Algorithmes d Apprentissage. The objectives of the project are described in the next section. Section 3 brieffly documents the concepts around variable selection. In section 4, we present a specific problem of data mining and proposed different approaches of variable selection to address the problem of relevance and dimensionality reduction. We document the results of our experimentation in section 5 and our conclusions in section 6. This is a preliminary version. At this stage, it is presented as a plan. It contains several elements that are incomplete and require further research and/or discussions. 1
Gilles Godbout 2. Project Objectives This project will concentrate on the study of various techniques for variable selection. The objectives of this project are three-fold: - to familiarize the author with the spectrum of approaches proposed to address the problems of relevance and dimensionality reduction in the current data mining field; - to compare various selection algorithms in order to solve a specific data mining problem; - to complement the LISA project PLearn library with one of the studied technique: the Gradient Directed Regularization for Linear Classification. 3. Variable Selection We can identify three benefits to variable selection: - to improve the prediction performance of the predictor; - to reduce the processing requirements of the predictor; - to provide for a better understanding of the data by identifying the variables most relevant to the problem at hand. The question is how to identify a subset of the input variables that will lead to the building of a good predictor. Guyon & Elisseeff(5) classified the currently proposed methods within three categories: filters, wrappers and embedded methods. 3.1 Filters Filters select subsets of variables as a pre-processing step, independently of the chosen predictor. They rank variables according to their predictive power which can be measured in various ways. Many filter techniques will rely on empirical estimates of the mutual information between each variable and the target. Clearly, ranking the input variables by their predictive power does provide some understanding of the data in relation with the problem at hand. However, these techniques have more difficulties identifying combinations of variables with high predictive power. These techniques can also be prone to the selection of redundant subsets of variables hence not addressing as well as one would wich, the issue of dimensionality. For these reasons, we have chosen in this project, not to include any experimentation with one of these techniques. 3.2 Wrappers Wrappers utilize the chosen predictor to score subsets of variable according to their predictive power. The main idea here is to train the predictor with many different subsets of the input variables and choose the subset providing the best generalization performance. In most cases, it would be prohibitive to try with each possible subset. So the question becomes how to search the space of all possible variable subsets. Some greedy forward 2
Variable Selection and backward search strategies have been proposed. Their processing requirements does not grow exponentially with the number of variables. Forward selection algorithms starts with an empty subset and progressively add the next most promising variable. Conversely, backward selection starts with the set of all variables and progressively eliminates the least promising one. Cross validation can be used to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. In this project, we will experiment with a greedy forward selection algorithm to try to improve the generalization performance of a chosen predictor. The algorithm developped is described in more details in section 4.3. 3.3 Embedded Methods Embedded methods perform implicit variable selection in the process of training the predictor itself. One way of acheiving this concept is to add a regularization term to the optimization criterion wich will control the complexity of the model by keeping some of the weights to zero, hence in fact applying the predictor to a subset of the input variables. In this project, we will experiment this concept using a l 1 -norm regularization term. The l 1 -norm regularization approach is described in more details in sections 4.4. Other approaches to implicit variable selection are intrinsic to the training algorithm itself. We will implement the Gradient Directed Regularization, a gradient descent algorithm in this category. We will use it in the learning of the weight vector and we will emphasize its variable selection capacity. The Gradient Directed Regularization is described in more details in sections 4.5. 4. The Data Mining Problem This section describes the various components of the experimentation that we want to carry. 4.1 The Data This project is centered around a set of available data wich presents many of the caracteristics typical of data-mining problems. It is a classification task and the target is binary (0 or 1). The inputs are noisy and in very high dimension (1092), thus overfitting is likely to occur ( curse of dimensionality ). There is a large imbalance between the two classes. There are only 8.7% examples of class 1. Finnaly, the number of examples is relatively small (8176), given the large dimension of the input variables. 3
Gilles Godbout 4.2 The Predictor Because of the imbalance between the two classes, the classification accuracy will not be a good measure of performance. We are proposing to use instead the logistic regression to build an estimator of P (Y X). In logistic regression, the probability of an example x i being drawn from the positive class is estimated by: P (Y = +1 X = x i ) = sigmoid( w k x ik ) where sigmoid(z) = 1 and 1 sigmoid(z) = sigmoid( z). Note that we have added 1+e z to each example, a feature x i0 = 1 and we have w 0 as a bias parameter. We will also pre-process the available data to modify the 0 class label for -1. Then, we can write: P (P = 1 X = x i ) = 1 P (Y = +1 X = x i ) = 1 sigmoid( w k x ik ) = sigmoid( w k x ik ) Therefore we can combine the two formulas to facilitate the implementation: P (Y = y X = x i ) = sigmoid(y w k x ik ) We will learn the parameters in order to maximize the likelihood of the training data. Hence, we will use the negative log-likelihood as the performance criterion to minimize: L(w) = log(sigmoid(y i w k x ik )) The weight vector w will be learned through gradient descent. Note that δ(sigmoid(z)) δz = sigmoid(z)sigmoid( z)dz. We can compute the gradient of the negative log-likelihood with respect to each w j with the following formula: δ(l(w)) δw j = y i x ij sigmoid( y i w k x ik ) Under a scenario with no regularization, the parameter update formula will be: w t j = w t 1 j ɛ δ(l(w)) δw j We will stop the learning when the sum of the absolute values of the weight differences is less than a threshold: δ(l(w)) γ δw j j=0 The parameter ɛ is the learning rate and we call the parameter γ, early stopping. Under this model both the learning rate and the early stopping parameters are hyper-parameters. They will be learned with cross-validation. Some questions to consider: Should we use a stochastic gradient descent to accelerate learning or does the imbalance precludes it? Should we use a momentum to progressively reduce learning rate? The two questions are related. 4
Variable Selection 4.3 The Greedy Forward Selection Algorithm This forward selection algorithms starts with an empty subset and progressively add the next most promising variable. It then uses cross validation to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. The pseudo-code of the algorithm is as follow: 010 we define d as the dimension of the input 020 we define A as our predictor 030 we define {X i } as the subset of the input containing dimension i 040 initialize subset B 0 = {} (an empty subset). 050 for d iteration with i = 1 to d do 060 for all {X j } not yet selected in B i 1 do 070 train A with B i 1 {X j } 080 select C i as the {X j } that maximized the performance criterion of 090 A trained with B i 1 {X j } 100 set B i = B i 1 C i 110 Use cross-validation to select between all B i Note on line 080, we can select between the {X j } directly from the training results since we are choosing between models of the same complexity. 4.4 The l 1 -norm Regularization The l 1 -norm of vector v is defined by: v 1 = i v i Tibshirani(11) has demonstrated that introducing a constraint in the form of l 1 -norm(v) t on a parameter vector v shrinks some coefficients and sets others to zero, and hence will acheive implicit variable subset selection if we apply this constraint on the non-bias components of our weight vector. This is equivalent to rewriting our performance criterion to be minimize as: L l1 (w, λ) = log(sigmoid(y i w k x ik )) + λ w k To overcome the difficulty of w k being a non-differentiable function, we will transform w into w + w and add the 2d additionnal constraints: w + k 0 and w k 0 for k {1 d} We will fix w0 = 0 and let w+ 0 move freely as our bias parameter since we do not need two. Then our criterion to minimize becomes: L l1 (w, λ) = log(sigmoid(y i k=1 (w + k w k )x ik)) + λ( (w + k + w k )) When j = 0, the formulas to compute the gradient and update the parameter w 0 + remains the same as in section 4.2. For j > 0, the formulas to compute the gradient of our criterion k=1 5
Gilles Godbout with respect to each w s j becomes: δ(l l1 (w, λ)) δw + j = y i x ij sigmoid( y i (w + k w k )x ik) + λ δ(l l1 (w, λ)) δw j = y i x ij sigmoid( y i (w + k w k )x ik) + λ Under this scenario with regularization, the parameter update formulas for j > 0 will be modified as follows to maintain the additional constraints: w s(t) w s(t 1) j = j ɛ δ(l l 1 (w,λ)) δw s if the result is greater than 0 j 0 otherwise The hyper-parameter λ is the regularization parameter wich will be learned together with the learning rate and the early stopping parameter with cross-validation. We need to verify that this is the proper way of maintaining the positive constraint on all the parameters. How is this acheiving implicit variable subset selection? We can easily see that choosing a zero regularization parameter λ reverts to the same algorithm than the one described in section 4.2 with no subset selection. As we grow λ, we progressively force the least promising variable weights to be zero, hence moving progressively towards an empty subset. 4.5 The Gradient Directed Regularization Instead of using a regularisation term to do implicit subset selection as in the previous section, this approach uses the parameter update algorithm of the gradient descent as the regularization mean. In this algorithm, we start with an empty subset of selected variables and we set all the variable weights to zero. Then, at each update step of the gradient descent, we first identify wich variable weight shows the largest gradient in absolute value. If that variable had not yet been selected, we include it in the subset of selected variable. After that we only update the weights of the variable in the subset of selected variables using the same formula as in section 4.2. When we compare this to the approach in the previous section, this is equivalent to starting with a very large value for the parameter λ and moving progressively towards a situation where λ = 0. The advantage is obviously to do away with having to learn an additional hyper-parameter. Have we understood this properly? We have not found references for this approach. The closest are Efron & all(3) Least Angle Regression and Hastie & all(6) Forward Stagewise Linear Regression. Friedman & all(4) eludes to it in reference to the previous two. It makes sense to us but we need to confirm before starting to spend more time on it. 6
Variable Selection 5. Experimentation Results This section will be developped after the previous section is completely finalized. It will present and compare the results of: - training the chosen predictor without any regularization, - training the predictor using greedy forward selection to choose a subset of variables, - training the predictor with a l 1 -norm regularization term as described in section 4.4. - training the predictor using the gradient directed regularization algorithm as described in section 4.5. One of the big todo is to map out the partionning of the available data in order to do all the training, validation and comparison implied in this section. 6. Conclusions This section will be developped after the results of the experimentation is known. It should establish wether we have indeed been able to improve the generalization performance of our chosen predictor by limiting the complexity of the model through variable selection. It will also identify additionnal work interesting to pursue on the subject of pre-processing for the purpose of relevance and dimensionality reduction. 7
Gilles Godbout References [1] Yoshua Bengio & Nicolas Chapados: Extensions to Metric Based Models, Journal of Machine Learning Research 3 (2003), 1209-1227 [2] Christopher M. Bishop: Chapter 8: Pre-Processing and Feature Extraction, Neural Network for Pattern Recognition, Oxford University Press (1995), 295-331 [3] Bradley Efron, Trevor Hastie, Iain Johnstone & Robert Tibshirani: Least Angle Regression, WebPublished Writings (2003), 1-44 [4] Jerome H. Friedman & Bogdan E. Popescu: Gradient Directed Regularization for Linear Regression and Classification, WebPublished Writings (2004), 1-40 [5] Isabelle Guyon & André Elisseeff: An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003), 1157-1182 [6] Trevor Hastie, Robert Tibshirani & Jerome Friedman: Section 10.12 Regularization, The Elements of Statistical Learning, Springer Publishing (2003), 324-331 [7] Simon Latendresse & Yoshua Bengio: Linear Regression and the Optimization of Hyper-Parameters, Web Published Writings, 1-7 [8] Baranidharan Raman & Thomas R. Ioerger: Enhancing Learning using Feature and Example Selection, Journal of Machine Learning Research 3 (2003), 1-37 [9] Jason Rennie: Logistic Regression, Web-Published Writings (2003), 1-3 [10] Saharon rosset & Ji Zhu: Piecewise Linear Regularized Solution Paths, Submission for a Workshop at NIPS (2003), 1-20 [11] Robert Tibshirani: Regression Shrinkage and Selection via the Lasso, The Journal of Royal Statistics Society, Series B, Volume 58, No. 1 (1996), 267-288 8