Variable Selection in Data Mining Project

Similar documents
Sparse Approximation and Variable Selection

MS-C1620 Statistical inference

LEAST ANGLE REGRESSION 469

Regularization Paths

Learning Binary Classifiers for Multi-Class Problem

Regularization Paths. Theme

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Logistic Regression. COMP 527 Danushka Bollegala

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Margin Maximizing Loss Functions

Dreem Challenge report (team Bussanati)

Kernel Logistic Regression and the Import Vector Machine

Introduction to Logistic Regression

Iterative Laplacian Score for Feature Selection

ECE521 Lectures 9 Fully Connected Neural Networks

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Analysis of Fast Input Selection: Application in Time Series Prediction

An Introduction to Statistical and Probabilistic Linear Models

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Neural Networks and the Back-propagation Algorithm

Statistical Machine Learning from Data

Pathwise coordinate optimization

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning for Biomedical Engineering. Enrico Grisan

Computational statistics

Classification Logistic Regression

Machine Learning for OR & FE

Linear & nonlinear classifiers

Tufts COMP 135: Introduction to Machine Learning

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Statistical Data Mining and Machine Learning Hilary Term 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 5: Logistic Regression. Neural Networks

Machine Learning Lecture 5

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE521 week 3: 23/26 January 2017

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Multilayer Neural Networks

Machine Learning Linear Regression. Prof. Matteo Matteucci

Least Angle Regression, Forward Stagewise and the Lasso

Adaptive Boosting of Neural Networks for Character Recognition

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

OPTIMIZATION METHODS IN DEEP LEARNING

Transductive Experiment Design

Data Mining und Maschinelles Lernen

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Linear Discrimination Functions

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

From perceptrons to word embeddings. Simon Šuster University of Groningen

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Lasso Regression: Regularization for feature selection

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Big Data Analytics. Lucas Rego Drumond

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Reading Group on Deep Learning Session 1

MSA220/MVE440 Statistical Learning for Big Data

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Deep Feedforward Networks

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

ECE521 Lecture7. Logistic Regression

Ordinal Classification with Decision Rules

Regularization Path Algorithms for Detecting Gene Interactions

Links between Perceptrons, MLPs and SVMs

Bayesian Support Vector Machines for Feature Ranking and Selection

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Statistical NLP for the Web

Lecture 2 Machine Learning Review

Neural Networks. Haiming Zhou. Division of Statistics Northern Illinois University.

Machine Learning

Logistic Regression & Neural Networks

Boosting as a Regularized Path to a Maximum Margin Classifier

Machine Learning Linear Models

COMP-4360 Machine Learning Neural Networks

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Variations of Logistic Regression with Stochastic Gradient Descent

Introduction to Machine Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

ECS171: Machine Learning

Optimization and Gradient Descent

CS6220: DATA MINING TECHNIQUES

Learning Neural Networks

The Entire Regularization Path for the Support Vector Machine

LASSO Review, Fused LASSO, Parallel LASSO Solvers

CSC 578 Neural Networks and Deep Learning

Multilayer Neural Networks

Applied Machine Learning Annalisa Marsico

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Neural Networks and Deep Learning

Transcription:

Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal Montreal, QC, Canada H3C 3J7 godbougi@iro.umontreal.ca Editor: Yoshua Bengio Keywords: Data Mining, Variable Selection, l 1 -norm Regularization, Gradient Directed Regularization 1. Introduction Bishop(2) establishes that in practical applications of data mining, the choice of preprocessing of the available data will be one of the most significant factors in determining the performance of the final system. Data mining situations are often caracterized with the availability of a large number of raw input input variables, sometimes in the tens of thousands range, and comparably few training examples. Learning algorithms perform well on domain with a relatively small number of relevant variables. They tend to degrade however in presence of a large number of variables including possibly irrelevant and redundant information. Many approaches have been proposed to address the problems of relevance and space dimensionality reduction of the input variables. They include algorithms for feature extraction, variable and feature selection, and example selection to name a few. This document present the report on a session project for the course IFT6266 Algorithmes d Apprentissage. The objectives of the project are described in the next section. Section 3 brieffly documents the concepts around variable selection. In section 4, we present a specific problem of data mining and proposed different approaches of variable selection to address the problem of relevance and dimensionality reduction. We document the results of our experimentation in section 5 and our conclusions in section 6. This is a preliminary version. At this stage, it is presented as a plan. It contains several elements that are incomplete and require further research and/or discussions. 1

Gilles Godbout 2. Project Objectives This project will concentrate on the study of various techniques for variable selection. The objectives of this project are three-fold: - to familiarize the author with the spectrum of approaches proposed to address the problems of relevance and dimensionality reduction in the current data mining field; - to compare various selection algorithms in order to solve a specific data mining problem; - to complement the LISA project PLearn library with one of the studied technique: the Gradient Directed Regularization for Linear Classification. 3. Variable Selection We can identify three benefits to variable selection: - to improve the prediction performance of the predictor; - to reduce the processing requirements of the predictor; - to provide for a better understanding of the data by identifying the variables most relevant to the problem at hand. The question is how to identify a subset of the input variables that will lead to the building of a good predictor. Guyon & Elisseeff(5) classified the currently proposed methods within three categories: filters, wrappers and embedded methods. 3.1 Filters Filters select subsets of variables as a pre-processing step, independently of the chosen predictor. They rank variables according to their predictive power which can be measured in various ways. Many filter techniques will rely on empirical estimates of the mutual information between each variable and the target. Clearly, ranking the input variables by their predictive power does provide some understanding of the data in relation with the problem at hand. However, these techniques have more difficulties identifying combinations of variables with high predictive power. These techniques can also be prone to the selection of redundant subsets of variables hence not addressing as well as one would wich, the issue of dimensionality. For these reasons, we have chosen in this project, not to include any experimentation with one of these techniques. 3.2 Wrappers Wrappers utilize the chosen predictor to score subsets of variable according to their predictive power. The main idea here is to train the predictor with many different subsets of the input variables and choose the subset providing the best generalization performance. In most cases, it would be prohibitive to try with each possible subset. So the question becomes how to search the space of all possible variable subsets. Some greedy forward 2

Variable Selection and backward search strategies have been proposed. Their processing requirements does not grow exponentially with the number of variables. Forward selection algorithms starts with an empty subset and progressively add the next most promising variable. Conversely, backward selection starts with the set of all variables and progressively eliminates the least promising one. Cross validation can be used to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. In this project, we will experiment with a greedy forward selection algorithm to try to improve the generalization performance of a chosen predictor. The algorithm developped is described in more details in section 4.3. 3.3 Embedded Methods Embedded methods perform implicit variable selection in the process of training the predictor itself. One way of acheiving this concept is to add a regularization term to the optimization criterion wich will control the complexity of the model by keeping some of the weights to zero, hence in fact applying the predictor to a subset of the input variables. In this project, we will experiment this concept using a l 1 -norm regularization term. The l 1 -norm regularization approach is described in more details in sections 4.4. Other approaches to implicit variable selection are intrinsic to the training algorithm itself. We will implement the Gradient Directed Regularization, a gradient descent algorithm in this category. We will use it in the learning of the weight vector and we will emphasize its variable selection capacity. The Gradient Directed Regularization is described in more details in sections 4.5. 4. The Data Mining Problem This section describes the various components of the experimentation that we want to carry. 4.1 The Data This project is centered around a set of available data wich presents many of the caracteristics typical of data-mining problems. It is a classification task and the target is binary (0 or 1). The inputs are noisy and in very high dimension (1092), thus overfitting is likely to occur ( curse of dimensionality ). There is a large imbalance between the two classes. There are only 8.7% examples of class 1. Finnaly, the number of examples is relatively small (8176), given the large dimension of the input variables. 3

Gilles Godbout 4.2 The Predictor Because of the imbalance between the two classes, the classification accuracy will not be a good measure of performance. We are proposing to use instead the logistic regression to build an estimator of P (Y X). In logistic regression, the probability of an example x i being drawn from the positive class is estimated by: P (Y = +1 X = x i ) = sigmoid( w k x ik ) where sigmoid(z) = 1 and 1 sigmoid(z) = sigmoid( z). Note that we have added 1+e z to each example, a feature x i0 = 1 and we have w 0 as a bias parameter. We will also pre-process the available data to modify the 0 class label for -1. Then, we can write: P (P = 1 X = x i ) = 1 P (Y = +1 X = x i ) = 1 sigmoid( w k x ik ) = sigmoid( w k x ik ) Therefore we can combine the two formulas to facilitate the implementation: P (Y = y X = x i ) = sigmoid(y w k x ik ) We will learn the parameters in order to maximize the likelihood of the training data. Hence, we will use the negative log-likelihood as the performance criterion to minimize: L(w) = log(sigmoid(y i w k x ik )) The weight vector w will be learned through gradient descent. Note that δ(sigmoid(z)) δz = sigmoid(z)sigmoid( z)dz. We can compute the gradient of the negative log-likelihood with respect to each w j with the following formula: δ(l(w)) δw j = y i x ij sigmoid( y i w k x ik ) Under a scenario with no regularization, the parameter update formula will be: w t j = w t 1 j ɛ δ(l(w)) δw j We will stop the learning when the sum of the absolute values of the weight differences is less than a threshold: δ(l(w)) γ δw j j=0 The parameter ɛ is the learning rate and we call the parameter γ, early stopping. Under this model both the learning rate and the early stopping parameters are hyper-parameters. They will be learned with cross-validation. Some questions to consider: Should we use a stochastic gradient descent to accelerate learning or does the imbalance precludes it? Should we use a momentum to progressively reduce learning rate? The two questions are related. 4

Variable Selection 4.3 The Greedy Forward Selection Algorithm This forward selection algorithms starts with an empty subset and progressively add the next most promising variable. It then uses cross validation to assess the performance of the predictor with the various subsets and to select the subset with best generalization power. The pseudo-code of the algorithm is as follow: 010 we define d as the dimension of the input 020 we define A as our predictor 030 we define {X i } as the subset of the input containing dimension i 040 initialize subset B 0 = {} (an empty subset). 050 for d iteration with i = 1 to d do 060 for all {X j } not yet selected in B i 1 do 070 train A with B i 1 {X j } 080 select C i as the {X j } that maximized the performance criterion of 090 A trained with B i 1 {X j } 100 set B i = B i 1 C i 110 Use cross-validation to select between all B i Note on line 080, we can select between the {X j } directly from the training results since we are choosing between models of the same complexity. 4.4 The l 1 -norm Regularization The l 1 -norm of vector v is defined by: v 1 = i v i Tibshirani(11) has demonstrated that introducing a constraint in the form of l 1 -norm(v) t on a parameter vector v shrinks some coefficients and sets others to zero, and hence will acheive implicit variable subset selection if we apply this constraint on the non-bias components of our weight vector. This is equivalent to rewriting our performance criterion to be minimize as: L l1 (w, λ) = log(sigmoid(y i w k x ik )) + λ w k To overcome the difficulty of w k being a non-differentiable function, we will transform w into w + w and add the 2d additionnal constraints: w + k 0 and w k 0 for k {1 d} We will fix w0 = 0 and let w+ 0 move freely as our bias parameter since we do not need two. Then our criterion to minimize becomes: L l1 (w, λ) = log(sigmoid(y i k=1 (w + k w k )x ik)) + λ( (w + k + w k )) When j = 0, the formulas to compute the gradient and update the parameter w 0 + remains the same as in section 4.2. For j > 0, the formulas to compute the gradient of our criterion k=1 5

Gilles Godbout with respect to each w s j becomes: δ(l l1 (w, λ)) δw + j = y i x ij sigmoid( y i (w + k w k )x ik) + λ δ(l l1 (w, λ)) δw j = y i x ij sigmoid( y i (w + k w k )x ik) + λ Under this scenario with regularization, the parameter update formulas for j > 0 will be modified as follows to maintain the additional constraints: w s(t) w s(t 1) j = j ɛ δ(l l 1 (w,λ)) δw s if the result is greater than 0 j 0 otherwise The hyper-parameter λ is the regularization parameter wich will be learned together with the learning rate and the early stopping parameter with cross-validation. We need to verify that this is the proper way of maintaining the positive constraint on all the parameters. How is this acheiving implicit variable subset selection? We can easily see that choosing a zero regularization parameter λ reverts to the same algorithm than the one described in section 4.2 with no subset selection. As we grow λ, we progressively force the least promising variable weights to be zero, hence moving progressively towards an empty subset. 4.5 The Gradient Directed Regularization Instead of using a regularisation term to do implicit subset selection as in the previous section, this approach uses the parameter update algorithm of the gradient descent as the regularization mean. In this algorithm, we start with an empty subset of selected variables and we set all the variable weights to zero. Then, at each update step of the gradient descent, we first identify wich variable weight shows the largest gradient in absolute value. If that variable had not yet been selected, we include it in the subset of selected variable. After that we only update the weights of the variable in the subset of selected variables using the same formula as in section 4.2. When we compare this to the approach in the previous section, this is equivalent to starting with a very large value for the parameter λ and moving progressively towards a situation where λ = 0. The advantage is obviously to do away with having to learn an additional hyper-parameter. Have we understood this properly? We have not found references for this approach. The closest are Efron & all(3) Least Angle Regression and Hastie & all(6) Forward Stagewise Linear Regression. Friedman & all(4) eludes to it in reference to the previous two. It makes sense to us but we need to confirm before starting to spend more time on it. 6

Variable Selection 5. Experimentation Results This section will be developped after the previous section is completely finalized. It will present and compare the results of: - training the chosen predictor without any regularization, - training the predictor using greedy forward selection to choose a subset of variables, - training the predictor with a l 1 -norm regularization term as described in section 4.4. - training the predictor using the gradient directed regularization algorithm as described in section 4.5. One of the big todo is to map out the partionning of the available data in order to do all the training, validation and comparison implied in this section. 6. Conclusions This section will be developped after the results of the experimentation is known. It should establish wether we have indeed been able to improve the generalization performance of our chosen predictor by limiting the complexity of the model through variable selection. It will also identify additionnal work interesting to pursue on the subject of pre-processing for the purpose of relevance and dimensionality reduction. 7

Gilles Godbout References [1] Yoshua Bengio & Nicolas Chapados: Extensions to Metric Based Models, Journal of Machine Learning Research 3 (2003), 1209-1227 [2] Christopher M. Bishop: Chapter 8: Pre-Processing and Feature Extraction, Neural Network for Pattern Recognition, Oxford University Press (1995), 295-331 [3] Bradley Efron, Trevor Hastie, Iain Johnstone & Robert Tibshirani: Least Angle Regression, WebPublished Writings (2003), 1-44 [4] Jerome H. Friedman & Bogdan E. Popescu: Gradient Directed Regularization for Linear Regression and Classification, WebPublished Writings (2004), 1-40 [5] Isabelle Guyon & André Elisseeff: An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003), 1157-1182 [6] Trevor Hastie, Robert Tibshirani & Jerome Friedman: Section 10.12 Regularization, The Elements of Statistical Learning, Springer Publishing (2003), 324-331 [7] Simon Latendresse & Yoshua Bengio: Linear Regression and the Optimization of Hyper-Parameters, Web Published Writings, 1-7 [8] Baranidharan Raman & Thomas R. Ioerger: Enhancing Learning using Feature and Example Selection, Journal of Machine Learning Research 3 (2003), 1-37 [9] Jason Rennie: Logistic Regression, Web-Published Writings (2003), 1-3 [10] Saharon rosset & Ji Zhu: Piecewise Linear Regularized Solution Paths, Submission for a Workshop at NIPS (2003), 1-20 [11] Robert Tibshirani: Regression Shrinkage and Selection via the Lasso, The Journal of Royal Statistics Society, Series B, Volume 58, No. 1 (1996), 267-288 8