Local Linear ICA for Mutual Information Estimation in Feature Selection

Local Linear ICA for Mutual Information Estimation in Feature Selection Tian Lan, Deniz Erogmus Department of Biomeical Engineering, OGI, Oregon Health & Science University, Portlan, Oregon, USA E-mail: {lantian,eniz}@bme.ogi.eu Abstract Mutual information is an important tool in many applications. Specifically, in classification systems, feature selection base on MI estimation between features an class labels helps to ientify the most useful features irectly relate to the classification performance. MI estimation is extremely ifficult an imprecise in high imensional feature spaces with an arbitrary istribution. We propose a framework using ICA an sample-spacing base entropy estimators to estimate MI. In this framework, the higher imensional MI estimation is reuce to inepenent multiple one-imensional MI estimation problems. This approach is computationally efficient, however, its precision heavily relies on the results of ICA. In our previous work, we assume the feature space has linear structure, hence linear ICA was aopte. Nevertheless, this is a weak assumption, which might not be true in many applications. Although nonlinear ICA can solve any ICA problem in theory, its complexity an the requirement of ata samples restrict its application. A better trae-off between linear an non-linear ICA coul be local linear ICA, which uses piecewise linear ICA to approximate the non-linear relationships among the ata. In this paper, we propose that by replacing linear ICA with local linear ICA, we can get precise MI estimation without greatly increasing the computational complexity. The experiment results also substantiate this claim. Keywors Local Linear ICA, Mutual Information, Entropy Estimation, Feature Selection I. INTRODUCTION Mutual information is an important tool in many applications, such as communications, signal processing, an machine learning. Specifically, in pattern recognition, imensionality reuction an feature selection base on mutual information maximization between features an class labels has attracte increasing attention, because this approach can fin out the most relevant features, therefore (i) reuces the computational loa in real-time system; (ii) can eliminate irrelevant or noisy features, hence increases the robustness of the system; (iii) is a filter approach, which is inepenent of the esign of classifier, an is more flexible. The MI base metho for feature selection is motivate by lower an upper bouns in information theory [2,3]. The average probability of error has been shown to be relate to MI between the feature vectors an the class labels. Specifically, Fano s an Hellman & Raviv s bouns emonstrate that probability of error is boune from below an above by quantities that epen on the Shannon MI between these variables. Maximizing this MI reuces both bouns, therefore, forces the probability of error to ecrease. One of the ifficulties of this MI-base feature selection metho is that, estimating MI requires the knowlege of joint pf of the ata in feature space. However, since features are generally mutually epenent, feature selection in this manner is typically suboptimal in the sense of maximum joint mutual information principle. Several MI-base methos have been evelope for feature selection in the past years [4-9]. Unfortunately, all of these methos faile to solve the high imensional situation. In practice, the mutual information must be estimate nonparametrically from the training samples. Although this is a challenging problem for multiple continuous-value ranom variables, the class labels are iscrete-value in the classification context. This reuces the problem to just estimating entropies of continuous ranom vectors. Furthermore, if the components of the ranom vector are inepenent, the joint entropy becomes the sum of marginal entropies. Thus, the joint mutual information of a feature vector with the class labels is equal to the sum of marginal mutual information of each iniviual feature with the class labels, provie that the features are inepenent. In our previous work, we exploite this fact an propose a framework using ICA transformation an sample-spacing estimator to estimate the mutual information between features an class labels [1]. This framework is superior because it is open to iverse algorithms, i.e. each component, incluing ICA transformation an entropy estimator can be replace by any qualifie algorithm/alternative.

Cluster 1 Class label c x 1 x 2 x Clustering Cluster n Linear ICA W (1) Linear ICA W (n) Mutual Information estimation c Mutual Information estimation Mutual Information Figure 1. Local linear ICA for mutual information i i We employe the cumulant-base generalize eigenvalue ecomposition (GED) approach [10] to etermine the linear ICA transformation an the sample-spacing estimator [11] as the marginal entropy estimator in EEG signal classification [12]. Although linear ICA yiels goo performance in some applications, it is a weak assumption in many real worl problems. Since this framework heavily relies on the performance of ICA, the precision of MI will be greatly impaire in nonlinear situations. Recently, nonlinear ICA has attracte more attention ue to its ability to capture the nonlinear relationships within the ata [13]. However, the complexity of fining robust regularize nonlinear transformations makes it ifficult to use in many situations. Furthermore, nonlinear ICA has the following shortcomings: (i) It requires more training ata, especially in higher imensional situations; (ii) our framework nees to be revise accoring to the form of the nonlinear ICA. In this case, local linear ICA (which will be use in this paper) presents a goo trae-off [14]. Local ICA uses piece-wise linear ICA transformations to approximate nonlinear ICA. In practice, a clustering algorithm is applie first to ivie the ata into segments. We assume the ata within segments have the linear relationship, thus we can use the previously propose framework to estimate MI within each segment. Accoring to the principle of information aition, total MI equals to the summation of the MIs of each segment. In this way, we can exten our previous framework easily to local ICA-base nonlinear MI estimation. The system block iagram of local ICA-base MI estimation is shown in Fig. 1. In the following sections, we will introuce the theoretic feasibility of local ICA for MI estimation. To verify the propose approach, we also present experimental results on synthetic an EEG ata sets. II. THEORETICAL BACKGROUND Consier a group of nonlinearly istribute, - imensional feature vectors: x=[x 1,x 2,...,x ] T. We first apply a suitable clustering algorithm to segment the ata into n partitions: x (1), x (2),, x (n). We assume within each partition x (i), the ata is imensional, an istribute in accorance with the linear ICA moel. Within each partition, we apply the linear ICA transformation to get feature vectors: y (1), y (2),, y (n), where y (i) =[y (i) 1,y (i) 2,,y (i) ] T. Since the linear ICA transformation oes not change the mutual information, we have: I S ( x ; c) = I S ( y ; c) (1) where i = 1 n, enote the n clusters, x (i) are the original epenent components, an y (i) are the inepenent components. It is well known that the mutual information can be expresse in terms of conitional entropy as follows: I S ( y ; c) = H S ( y ) pc H S ( y c) (2) c where c is the class label, p c are the prior class probabilities. If the components of the ranom vector y (i) are inepenent, the joint an joint-conitional entropies become the sum of marginal an marginal-conitional entropies: 1 I S ( y, y, K, y ; c) H S ( y ) 1 2 (3) Pc H S ( y c) c It is easy to show that the total MI between x an c equals to the summation of the MI within each segmente ata group: I S ( x; c) = I S ( x ; c) (4) i Combining (1)-(4), we can get: I S ( x1, x2, K, x ; c) H S ( y ) i (5) Pc H S ( y c) c In principle, the propose local ICA for MI estimation contains three parts: clustering algorithm, linear ICA algorithm, an one-imensional entropy 1 Here we assume that the same ICA transformation achieves inepenence in the overall conitional istributions simultaneously.

estimator. For each component, one can use any establishe metho. In this paper, we use K-means clustering, GED-base ICA transformations, an sample-spacing entropy estimators. K-means clustering: The K-means metho is efine as to minimize the over all istance of the ata to the center of K clusters: J = k x S i x m i where m i is the center of each cluster. This algorithm first selects K arbitrary cluster centers, an then calculates the istance between all ata points to these clusters center respectively. By grouping the nearest ata to the each center in a particular group, we get a segmentation of the ata set. Because we select the cluster center arbitrarily, this segmentation may not minimize eq (6). Replacing the each cluster center with the mean value of each group, we get a new set of cluster centers. The process above is repeate until J converges to its minimum value. ICA Using Generalize Eigenvalue Decomposition: The square linear ICA problem is expresse in (7), where X is the n N observation matrix, A is the n n mixing matrix, an S is the n N inepenent source matrix. X = AS (7) Each column of X an S represents one sample of ata. Many effective an efficient algorithms base on a variety of assumptions incluing maximization of non- Gaussianity an minimization of mutual information exist to solve the ICA problem [11,15,16]. Those utilizing fourth orer cumulants coul be compactly formulate in the form of a generalize eigenecomposition problem that gives the ICA solution in an analytical form [10]. Accoring to this formulation, one possible assumption set that leas to an ICA solution utilizes the higher orer statistics (specifically fourth-orer cumulants). Uner this set of assumptions, the separation matrix W is the solution to the following generalize eigenecomposition problem: R x W = Q x WΛ (8) where R x is the covariance matrix an Q x is the cumulant matrix estimate using sample averages: Q x =E[x T xxx T ]-R x tr(r x )-E[xx T ]E[xx T ]-R x R x. Given the estimates for these matrices, the ICA solution can be easily etermine using efficient generalize eigenecomposition algorithms (or using the eig comman in Matlab). Estimating MI Using Sample-Spacings: There exist many entropy estimators in the literature for singleimensional variables. Here, we use an estimator base on sample-spacings [11], which stems from orer statistics. This estimator is selecte because of its 2 (6) Figure 2. Synthetic ataset. Top: istribution of x 1 an x 2, bottom: istribution of x 3 an x 4. consistency, rapi asymptotic convergence, an simplicity. Consier a one imensional ranom variable Y. Given a set of ii samples of Y {y 1,,y N }, first these samples are sorte in increasing orer such that y (1) y (N). The m-spacing entropy estimator is given by: N m N + y i+ m y i H 1 ( 1)( ( ) ( ) ) ( Y ) = log N m m (9) i= 1 The selection of the parameter m is etermine by a bias-variance trae-off an typically m = N. In general, for asymptotic consistency the sequence m(n) shoul satisfy lim m( N) = lim m( N) / N = 0 (10) N N III. EXPERIMENTS AND RESULTS

Synthetic Dataset: In orer to illustrate the feasibility an the performance of the propose local ICA for MI estimation approach, we apply it to a synthetic ataset. This ataset consists of four imensional feature vectors: x i (i=1,,4), where x 1 an x 2 are nonlinearly relate (Fig 2 top), x 3 an x 4 are Gaussian istribute with ifferent mean an variance (Fig. 2 bottom). There are two classes in this ataset (represente as blue an re in Fig. 2). These two classes are separable in the x 1 an x 2 plane, but overlapping in the x 3 an x 4 plane. It is clear that this ataset can be well classifie by only using x 1 an x 2, while x 3 an x 4 provies reunant an insufficient information for classification. From the Fig. 2 we can see that x 2 has less overlap compare with x 1, while x 3 has less overlap than x 4. So ieally, the feature ranking in escening orer of importance in terms of classification rate shoul be x 2, x 1, x 3, x 4. In our experiments, we choose the sample size as 10000, an segment ata into 50 partitions. The + in Fig.2 represents the cluster centers. In the top figure we can see these centers istribute evenly along the curve of each classes, segmenting the nonlinear ata into piecewise linear groups. The choice of the number of centers K is very critical. By choosing large K, the linear relationship within each segment becomes stronger. However, a large K will cause iminishing sample size within some partitions, which will cause inaccurate ICA transformation estimates an entropy estimates. By applying the propose metho on this synthesis ataset, we get the feature ranking result: x 2, x 1, x 3, x 4, which is consistent with our expectation. EEG ata classification: To compare the performance of local ICA for MI estimation with the previous linear ICA approximation for MI estimation, we apply both approaches on an EEG-base brain computer interface (BCI) ataset. The EEG ata is collecte as part of an augmente cognition project, in which the estimate cognitive state is use to assess the mental loa of the subject in orer to moify the interaction of the subject with a computer system with the goal of increasing user performance. During ata collection, two subjects are require to execute ifferent levels of mental tasks, which are classifie as high workloa an low workloa. EEG ata is collecte using a BioSemi Active Two system using a 31 channel (AF3, AF4, C3, C4, CP1, CP2, CP5, CP6, Cz, F3, F4, F7, F8, FC1, FC2, FC5, FC6, Fp1, Fp2, Fz, O1, O2, Oz, P3, P4, P7, P8, PO3, PO4, Pz) EEG cap an eye electroes. Vertical an horizontal eye movements an blinks were recore with electroes below an lateral to the left eye. EEG is sample an recore at 256Hz from 31 channels. EEG signals are pre-processe to remove eye blinks using an aaptive linear filter base on the Wirow- R ate n o a ti ific C l a ss 1 0.9 0.8 0.7 0.6 0.5 0.4 Classification rate vs. number of best EEG channels Linear ICA subject 1 Linear ICA subject 2 Local ICA subject 1 Local ICA subject 2 5 10 15 20 25 30 n best EEG channels Figure 3 EEG channel ranking in terms of classification rate for two subjects by linear ICA an local linear ICA. Hoff training rule (LMS) [17]. Information from the VEOGLB ocular reference channel was use as the noise reference source for the aaptive ocular filter. DC rifts were remove using high pass filters (0.5Hz cutoff). A ban pass filter (between 2Hz an 50Hz) was also employe, as this interval is generally associate with cognitive activity. The power spectral ensity (PSD) of the EEG signals, estimate using the Welch metho [18] with 50%-overlapping 1-secon winows, is integrate over 5 frequency bans: 4-8Hz (theta), 8-12Hz (alpha), 12-16Hz (low beta), 16-30Hz (high beta), 30-44Hz (gamma). These bans, sample every 0.25 secons, are use as the basic features for cognitive classification. The novelty in this application is that the subjects are freely moving aroun in contrast to the typical brain-computer interface (BCI) experimental setups where the subjects are in a strictly controlle setting. The assessment of cognitive state in ambulatory subjects is particularly ifficult, since the movements introuce strong artifacts irrelevant to the mental task/loa. Furthermore, the features extracte from EEG ata exhibit strong nonlinearity. In this case, feature selection by local ICA becomes important ue to its abilities to precisely keep the useful information an eliminate the irrelevant information for classification. To test the performance of local ICA for MI estimation in feature selection, we apply the EEG ata into the classification system, which contains four parts: preprocessing, feature extraction an selection, classification, an postprocessing. Preprocessing is use to filter out noise an remove the artifacts as mentione above. Feature extraction an selection generates features from the clean EEG signal, an selects useful EEG channels (each channel contains 5 frequency

bans) using the propose metho. Consier we have aroun 2500 ata samples for each subject, an the imension of feature is 155 (31 EEG channels, 5 frequency ban each), we use K=4, which means to segment ata into 4 groups. This is a roughly approximation. However, we can not choose K to be a larger number because on average we only have aroun 600 ata samples for each group with 155 imensions. For classification, the K-Nearest-Neighbor (KNN) classifier is utilize. The postprocessing uses the assumption that the variations in cognitive state for a given continuous task will be slowly varying in time. A meian filter operating on a winow of 2-secon ecisions recently generate by the classifier is use to eliminate a portion of erroneous ecisions mae by the classification system. The EEG channel selection results evaluate by classification rate are shown in Fig. 3. The re lines represent the classification performance with ifferent number of optimal EEG channels for subject 1, while the blue lines are for subject 2. As a comparison, we also illustrate the performance using linear ICA for EEG channel ranking on both subjects. The soli line with stars illustrates the classification results for local ICA, while, the ashe line with circles illustrates the classification results for linear ICA. From Fig. 3 we can see that the ability to fin an optimum subset for local ICA is superior to linear ICA: for subject 1, we get better classification performance from 7 optimal EEG channels; for subject 2, we get better performance from 1 optimal EEG. To compare feature ranking/selection results for linear ICA an local ICA more clearly, we list the EEG channel ranking in escen orer in terms of contribution to classification for both subjects in the Table. Table. EEG channel ranking (escening orer) in terms of contribution to classification rate for subject 1 an subject 2 with linear ICA an local ICA methos. Subject Metho EEG channel ranking Sub 1 Linear ICA Local ICA Sub 2 Linear ICA FC2, AF3, CPZ, FP1, CP5, CP1, C4, CP6, P3, CP2, F4, F3, PO4, O2, P4, O1, PZ, P8, FCZ, FC1, FC6, AF4, FC5, FZ, P7, F8, CZ, FP2, F7, PO3, OZ FC2, AF3, CPZ, AF4, FC5, F7, CZ, O2, F3, F4, FC6, C4, F8, P3, FP2, CP6, P8, PZ, P7, FZ, FC1, OZ, PO3, FCZ, FP1, CP2, CP1, P4, CP5, PO4, O1 FC1, CP1, CZ, O1, C4, F3, FCZ, FC2, FZ, CP2, AF3, FP1, CP6, F4, P3, CPZ, CP5, AF4, FC6, P7, PO4, OZ, PZ, PO3, P4, F8, FC5, O2, F7, FP2, P8 Local ICA CP1, O1, FP1, CZ, 'FC1, P8, PO4, FP2, FCZ, P7, F4, P3, P4, PO3, CP6, FC6, CPZ, FC5, AF4, FZ, F3, CP5, F7, F8, AF3, CP2, C4, PZ, FC2, O2, OZ IV. CONCLUSIONS In this paper, we presente a local ICA approach for MI estimation in feature selection. This work is an extension to our previous proposal of using linear ICA transformations plus sample-spacing entropy estimators for MI estimation. The local ICA approach combines the clustering algorithm of choice (K-means in this paper), the cumulant-base analytical solution for linear ICA transformations, an a computationally efficient mutual information estimator, by taking into account the fact that minimization of Bayes classification error can be approximately achieve by maximizing the mutual information between the features an the class labels. Local ICA is use to approximate nonlinear ICA piecewise linearly for ata whose istribution is nonlinearly generate. This yiels increase performance an accuracy compare with linear ICA transformations. In theory, nonlinear ICA can perfectly transform epenent components into inepenent feature vectors; however, the complexity an requirement for large amount of ata samples limit its applications. On the other han, local ICA is relatively simple compare with nonlinear ICA an requires less ata samples, which is preferre in most of the applications. We must mention that in orer to estimate MI precisely, both nonlinear an local ICA nee a large ata set, while local ICA leave much free space for a trae-off between accuracy an complexity. Experiments using synthetic an real (EEG) ata emonstrate the utility, feasibility an the effectiveness of the propose technique. In comparison with the previous linear ICA approach, it is observe (as expecte) that local linear ICA yiels better performance than linear ICA. ACKNOWLEDGEMENTS This work was supporte by DARPA uner contract DAAD-16-03-C-0054. The EEG ata was collecte at the Human-Centere Systems Lab., Honeywell, Minneapolis, Minnesota. REFERENCES [1] T. Lan, D. Erogmus, A. Aami, M. Pavel, Feature Selection by Inepenent Component Analysis an Mutual Information Maximization in EEG Signal Classification, accepte to IJCNN 2005, Montreal, Canaa, 2005. [2] R.M. Fano, Transmission of Information: A Statistical Theory of Communications, Wiley, New York, 1961.

[3] M.E. Hellman, J. Raviv, Probability of Error, Equivocation an the Chernoff Boun, IEEE Transactions on Information Theory, vol. 16, pp. 368-372, 1970. [4] K. Torkkola, Feature Extraction by Non- Parametric Mutual Information Maximization, Journal of Machine Learning Research, vol. 3, pp. 1415-1438, 2003. [5] R. Battiti, Using Mutual Information for Selecting Features in Supervise Neural Networks learning, IEEE Trans. Neural Networks, vol. 5, no. 4, pp. 537-550, 1994. [6] A. Ai-ani, M. Deriche, An Optimal Feature Selection Technique Using the Concept of Mutual Information, Proceeings of ISSPA, pp. 477-480, 2001. [7] N. Kwak, C-H. Choi, Input Feature Selection for Classification Problems, IEEE Transactions on Neural Networks, vol. 13, no. 1, pp. 143-159, 2002. [9] H.H. Yang, J. Mooy, Feature Selection Base on Joint Mutual Information, in Avances in Intelligent Data Analysis an Computational Intelligent Methos an Application, 1999. [9] H.H. Yang, J. Mooy, Data Visualization an Feature Selection: New Algorithms for Nongaussian Data, Avances in NIPS, pp. 687-693, 2000. [10] L. Parra, P. Saja, Blin Source Separation via Generalize Eigenvalue Decomposition, Journal of Machine Learning Research, vol. 4, pp. 1261-1269, 2003. [11] E.G. Learne-Miller, J.W. Fisher III, ICA Using Spacings Estimates of Entropy,.Journal of Machine Learning Research, vol. 4, pp. 1271-1295, 2003. [12] Deniz Erogmus, Anre Aami, Michael Pavel, Tian Lan, Santosh Mathan, Stephen Whitlow, Michael Dorneich, Cognitive State Estimation Base on EEG for Augmente Cognition. Proceeings of the 2 n International IEEE EMBS Conference on Neural Engineering, Arlington, Virginia, Mar. 16-19, 2005. [13] C. Jutten, J. Karhunen, Avances in Nonlinear Blin Source Separation, Proc. of the 4 th Int. Symp. on Inepenent Component Analysis an Blin signal separation (ICA2003), Nara, Japan, April 1-4, 2003, pp. 245-256. [14] J. Karhunen, S. Malaroiu, Locally Linear Inepenent Component Analysis, Proc. of the Int. Joint Conf. on Neural Networks (IJCNN 99), July, 1999, Washington DC., USA, pp. 882-887 [15] K.E. Hil II, D. Erogmus, J.C. Principe, Blin Source Separation Using Renyi's Mutual Information, IEEE Signal Processing Letters, vol. 8, no. 6, pp. 174-176, 2001. [16] A. Hyvärinen, E. Oja, A Fast Fixe Point Algorithm for Inepenent Component Analysis, Neural Computation, vol. 9, no. 7, pp. 1483-1492, 1997. [17] B. Wirow an M. E. Hoff, "Aaptive switching circuits," in IRE WESCON Convention Recor, 1960, pp. 96-104. [18] P. Welch, The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Metho Base on Time Averaging Over Short Moifie Perioograms, IEEE Transactions on Auio an Electroacoustics, vol. 15, no. 2, pp. 70-73, 1967.