Iterative Laplacian Score for Feature Selection

Similar documents
Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Non-negative Matrix Factorization on Kernels

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Statistical and Computational Analysis of Locality Preserving Projection

Discriminant Uncorrelated Neighborhood Preserving Projections

Local Learning Projections

Nonlinear Dimensionality Reduction. Jose A. Costa

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Dimensionality Reduction:

Locality Preserving Projections

Non-linear Dimensionality Reduction

Supervised locally linear embedding

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

ECE 661: Homework 10 Fall 2014

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

Nonlinear Dimensionality Reduction

Informative Laplacian Projection

Feature gene selection method based on logistic and correlation information entropy

Department of Computer Science and Engineering

A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation

L26: Advanced dimensionality reduction

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

Variable Selection in Data Mining Project

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;

Weighted Maximum Variance Dimensionality Reduction

Statistical Pattern Recognition

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

On Improving the k-means Algorithm to Classify Unclassified Patterns

Filter Methods. Part I : Basic Principles and Methods

Statistical Machine Learning

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

When Dictionary Learning Meets Classification

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Enhanced Fourier Shape Descriptor Using Zero-Padding

Weighted Maximum Variance Dimensionality Reduction

Notes on Latent Semantic Analysis

Preserving Privacy in Data Mining using Data Distortion Approach

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Adaptive Affinity Matrix for Unsupervised Metric Learning

Introduction of Recruit

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Group Sparse Non-negative Matrix Factorization for Multi-Manifold Learning

2 GU, ZHOU: NEIGHBORHOOD PRESERVING NONNEGATIVE MATRIX FACTORIZATION graph regularized NMF (GNMF), which assumes that the nearby data points are likel

Robust Laplacian Eigenmaps Using Global Information

Improved Local Coordinate Coding using Local Tangents

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Orthogonal Laplacianfaces for Face Recognition

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Self-Tuning Semantic Image Segmentation

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Kernel-based Feature Extraction under Maximum Margin Criterion

Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011

A graph based approach to semi-supervised learning

Data-dependent representations: Laplacian Eigenmaps

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Coupled Dictionary Learning for Unsupervised Feature Selection

Symmetric Two Dimensional Linear Discriminant Analysis (2DLDA)

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Semi Supervised Distance Metric Learning

Issues and Techniques in Pattern Classification

EUSIPCO

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

Minimal Attribute Space Bias for Attribute Reduction

Analysis of Spectral Kernel Design based Semi-supervised Learning

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

FRaC: A Feature-Modeling Approach for Semi-Supervised and Unsupervised Anomaly Detection

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Classification on Pairwise Proximity Data

Graph Quality Judgement: A Large Margin Expedition

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

SEMI-SUPERVISED classification, which estimates a decision

Preprocessing & dimensionality reduction

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ISSN Article. An Estimate of Mutual Information that Permits Closed-Form Optimisation

What is semi-supervised learning?

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition

Transcription:

Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006, China {zhulinling,linsongmiao,dqzhang}@nuaa.edu.cn Abstract. Laplacian Score (LS) is a popular feature ranking based feature selection method both supervised and unsupervised. In this paper, we propose an improved LS method called Iterative Laplacian Score (), based on iteratively updating the nearest neighborhood graph for evaluating the importance of a feature by its locality preserving ability. Compared with LS, the key idea of is to gradually improve the nearest neighbor graph by discarding the least relevant features at each iteration. Experimental results on several high dimensional data sets demonstrate the effectiveness of our proposed method. Keywords: feature selection, Iterative Laplacian score, Laplacian score, locality preserving. Introduction Feature selection has become an important preprocessing step in machine learning and data mining due to the rapid accumulation of high-dimensional data such as gene expression microarrays, digital images and text data. Feature selection is quite effective in reducing dimensionality, removing irrelevant and redundant features. According to [], feature selection methods can be simply classified into wrapper methods and filter methods. he wrapper methods need a predetermined learning algorithm to evaluate the features, while the filter methods are totally independent of any learning algorithm. hey evaluate the features by examining the intrinsic properties of the data. hough the wrapper methods may usually outperform the filter methods in terms of accuracy, the later is more efficient in computation [8]. When confronted with huge number of features, the filter methods are more attractive. In this paper, we are particularly interested in the filter feature selection methods. Data variance, Pearson correlation coefficients, Fisher score and Kolmogorov-Smirnov test are all typical filter methods. From another perspective, feature selection methods can be divided into supervised and unsupervised methods [2, 3]. Supervised methods evaluate the feature importance by the correlation of the feature and class label. Pearson correlation coefficients, Fisher score and Kolmogorov-Smirnov test are supervised method. Unsupervised C.-L. Liu, C. Zhang, and L. Wang (Eds.): CCPR 202, CCIS 32, pp. 80 87, 202. Springer-Verlag Berlin Heidelberg 202

is B. B and B Iterative Laplacian Score for Feature Selection 8 methods evaluate the importance of the feature by the capability of keep certain properties of the data, such as variance or locality preserving ability [4, 5]. Data variance, as an unsupervised method, is perhaps the simplest criterion for selecting representative features. Data variance projects the data points along the dimensions of maximum variances, finding the features useful for representing the data. While Fisher score [6] selects the most important features for discrimination. Laplacian Score (LS) is a recently proposed feature selection method, which can be used in either supervised or unsupervised scenarios [9]. It seeks features which best reflect the underlying manifold structure. Laplacian Score has been proved effective and efficient compared with data variance and Fisher score. In this paper, we proposed an improved Laplacian score algorithm called Iterative Laplacian score (). A series of experiments have been implemented on several high dimensional UCI datasets and face data sets to compare our algorithm with Laplacian score and other feature selection algorithms, e.g., variance,etc. he experimental results show that our algorithm outperforms original Laplacian score algorithm on both classification and clustering tasks. he rest of this paper is organized as follows. Section 2 briefly introduces the Laplacian score algorithm as background. In Section 3 we present the proposed algorithm in detail. he experimental results are reported in section 4. Finally, Section 5 concludes this paper. 2 Laplacian Score LS constructs a nearest neighbor graph to model the local structure and then selects those features which best respect this graph structure. Suppose we have m data points and each point has n features. Let L r denote the Laplacian score of the r -th feature, r = n. Let f ri denote the i -th sample of the r -th feature, i = m. LS algorithm is described as follows [4]: Step : Constructing a nearest neighbor graph G. he i -th node corresponds to the i -th data point x_i. We put an edge between nodes i and j if x B i x B j are x B i among k nearest neighbors of x B j or neighbors of x B i Step 2: Choosing the weights. Put w x i x t j close. We define close in the way that, x B j is among k nearest w ij, ij = e if nodes i and j are connected, otherwise B = where t is a constant. Step 3: Computing the Laplacian score. Define f r = > f ri f r f @ rm, the Laplacian score of the r -th feature L r is computed as

, 82 L. Zhu, L. Miao, and D. Zhang L r r L r r D, r Where D is a diagonal matrix with f D fa r = f r r, D = > @. D B ii = j W ji Laplacian score algorithm chooses those features with lowest scores. and L = D W, 3 Iterative Laplacian Score Algorithm As in LS, in the proposed method, we also evaluate the feature importance according to the locality preserving ability. Both algorithms are based on the observation that, data points from the same class probably are close to each other. he key idea of is that, at each iteration we discard the least relevant features according to their current Laplacian scores. As a result, improves the nearest neighbor graph every time, and selects feature subsets which respect the structure of the underlying manifold better than original Laplacian score. We develop two versions of, and describe them in detail as follows. - Input: data set X with n features, selected feature number s Output: s ranked feature list Repeat: () Construct nearest neighbor graph for data X; (2) Compute Laplacian score for the features in X using LS algorithm; (3) Rank the features according to their Laplacian scores in ascending order; (4) Discard the last d features and update data X consisting of only the rest features; Until: no more than s features are left. -2 Input: data set X with n features, selected feature number s Output: s ranked feature list Repeat: () Construct nearest neighbor graph for data X; (2) Compute Laplacian score for all the features in initial X using LS algorithm; (3) Rank the features according to their Laplacian scores in ascending order; (4) Discard the last d features and update data X consisting of only the rest features; Until: no more than s features are left. Obviously, at each iteration, both - and -2 reconstruct the nearest neighbor graph for data X only using the rest features. he difference between - and -2 is that, the former computes Laplacian scores for the

Iterative Laplacian Score for Feature Selection 83 features excluding those discarded ones, but the later computes Laplacian scores for all the features. hat means in algorithm -2 those features discarded before could be reselected in later iteration, but not the same in algorithm -. Experimental results show that, - and -2 almost have the same results, so in the following sections we only show the results of -, and call it as for convenience. 4 Experiments In this section, we investigate the performance of our proposed algorithm. Both supervised (classification) and unsupervised (clustering) experiments are carried out to demonstrate the effectiveness of our proposed algorithm. Experiments are conducted on several high-dimensional data sets including UCI data and ORL face database. In our experiments, we choose k=5 to construct the nearest neighbor graph. 4. Results on UCI Data Sets In classification experiments, we compare with variance and LS. We test performances of different algorithms using 0 times 0-fold cross validation, and choose the nearest neighborhood (NN) with Euclidean distance as classifier. he performances of all methods are measured by the classification accuracy using selected features on test datasets. In algorithm, we discard one feature each iteration (i.e., d = ). he statistics of the UCI data sets are given in table. Fig. shows the curves of three algorithms with accuracy vs. different numbers of selected features. able. statistics of the UCI data sets Data set Sample Feature Class Wine 78 3 3 Breast cancer 98 33 2 Ionosphere 35 34 2 Dermatology 358 34 6 Anneal 898 38 5 Horse 368 27 2 Parkinsons 57 6 2 Zoo 0 6 7 From Fig. we can see that, the performance of is almost always better than LS and variance. Specifically, significantly outperforms LS when selected feature number is much lower than the original data dimension, such as from 5 to 0.

84 L. Zhu, L. Miao, and D. Zhang 0 2 4 6 8 0 2 4 (a) Wine 5 5 5 0 5 0 5 20 25 30 (b) Breast cancer 5 5 5 0 5 0 5 20 25 30 35 (c) Ionosphere 0 5 0 5 20 25 30 35 (d) Dermatology 2 8 6 4 0.2 0 0 20 40 60 80 00 (e) Anneal 2 0 0 20 30 40 50 60 (f) Horse 5 5 5 0 5 0 5 20 0 2 4 6 8 0 2 4 6 (g) Parkinsons (h) Zoo Fig.. vs. different number of selected features on eight UCI data sets

Iterative Laplacian Score for Feature Selection 85 4.2 Face Clustering on ORL Data Set In this section, we applied to face clustering on ORI face database with comparison to LS. he ORL database contains images from 40 individuals, 0 different images for each individual. We crop the original images of ORL to size of 32x32, and use the original pixel intensity as features, so a face image is represented by a 024 dimensional feature vector. In our experiment, several tests are performed with different number of clusters (i.e.,5,0,5,40). LS and are used to select features, and then K-means is performed in the selected feature subspace. K-means is repeated 0 times with different initializations and recorded the best result. his process is repeated 20 times, and the average performance is computed. We evaluate the clustering performance of different algorithm using two measurements, i.e. the accuracy and the normalized mutual information metric [7]. In, we set the number of discarded features to be 5 (i.e.,d=5). Fig.2 shows the results with different number of selected features. 5 5 5 5 Mutual Information (a) 5 classes 5 5 5 5 Mutual Information 5 5 5 5 5 (b) 0 classes Fig. 2. Clustering results with different number of selected features

86 L. Zhu, L. Miao, and D. Zhang 5 5 5 5 Mutual Information 5 5 5 5 5 (c) 5 classes 5 5 5 0.25 Mutual Information 5 5 5 5 (d) 40 classes Fig. 2. (Continued) As can be seen from Fig.2, our proposed algorithm is superior in performance compared to original LS. ake a closer look at Fig.2, we can find out that, when selected feature number is smaller than 400, is significantly better than LS. All these experimental results demonstrate that our proposed algorithm is more effective than LS. 4.3 Further Discussion In order to investigate the influence of discarded feature number (i.e., d) at each iteration, we conducted experiment on face data ORL with different number of discarded feature. In our experiment, we randomly select 2, 5 and 8 images for each individual as train data, and the rest of images as test data respectively. hen and LS are applied on the face data to select features, and nearest neighborhood classifier is used to measure the performance of the algorithms. Fig.3 plots the curves of accuracy vs. different number of selected features. From Fig.3 we can draw the conclusion that, when the dimension of original data is large, the number of discarded features at each iteration does not matter much to the final results.

Iterative Laplacian Score for Feature Selection 87-2 -5-0 0.2-2 -5-0 -2-5 -0 Fig. 3. vs. different number of selected features as different discarded feature number 5 Conclusion In this paper, we proposed a new Iterative Laplacian Score () method for feature selection, based on evaluating the feature importance according to the locality preserving ability for the underlying manifold structure. In each iteration of, we discard the least significant features according to corresponding current scores of features, and reconstruct the nearest neighbor graph with the rest of the features. he experimental results on both UCI data sets and face databases demonstrate the effectiveness of our proposed algorithm. Particularly, when very few features are selected, can achieve significantly better performance than conventional Laplacian Score (LS). References. Belkin, M., Niyogi, P.: Laplacian Egienmaps and Spectral echniques for Embedding and Clustering. In: Advances in Neural Information Processing System, vol. 4 (200) 2. Dy, J.G., Brodley, C.E., Kak, A.C., Broderick, L.S., Aisen, A.M.: Unsupervised feature selection applied to content-based retrieval of lung image. IEEE rans. Pattern Anal. Mach Intell. 25, 373 378 (2003) 3. Dy, J.G., Brodley, C.E.: Feature selection for unsupervised feature learning. J. March. Learning Res. 5, 845 889 (2004) 4. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing System, vol. 7. MI press, Cambridge (2005) 5. He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing System, vol. 6. MI Press, Cambridge (2004) 6. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (995) 7. Xu, W., Liu, X., Gong, Y.: Document Clustering Based on Non-negative Matrix Factorization. In: ACM SIGIR Conference on Information Retrieval (2003) 8. Sun, Y., odorovic, S., Goodison, S.: Local-learning-Based Feature Selection for High- Dimensional Data Analysis. IEEE ransactions on Pattern Analysis and Machine Intelligent 32(9) (200) 9. Xu, J., Man, H.: Dictionary Learning Based on Laplacian Score in Sparse Coding. In: Perner, P. (ed.) MLDM 20. LNCS (LNAI), vol. 687, pp. 253 264. Springer, Heidelberg (20)