A Geometric Theory of Feature Selection and Distance-Based Measures

Size: px

Start display at page:

Download "A Geometric Theory of Feature Selection and Distance-Based Measures"

Norah Lynch
5 years ago
Views:

1 Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 25) A Geometric Theory of Feature Selection and Distance-Based Measures Kilho Shin and Adrian Pino Angulo University of Hyogo Kobe, Japan yshin@ai.u-hyogo.ac.jp Abstract Feature selection measures are often explained by the analogy to a rule to measure the distance of sets of features to the closest ideal sets of features. An ideal feature set is such that it can determine classes uniquely and correctly. This way of explanation was just an analogy before this paper. In this paper, we show a way to map arbitrary feature sets of datasets into a common metric space, which is indexed by a real number p with p. Since this determines the distance between an arbitrary pair of feature sets, even if they belong to different datasets, the distance of a feature set to the closest ideal feature set can be used as a feature selection measure. Surprisingly, when p =, the measure is identical to the Bayesian risk, which is probably the feature selection measure that is used the most widely in the literature. For < p, the measure is novel and has significantly different properties from the Bayesian risk. We also investigate the correlation between measurements by these measures and classification accuracy through experiments. As a result, we show that our novel measures with p > exhibit stronger correlation than the Bayesian risk. Introduction Feature selection is indeed one of the central focuses of machine learning research. In this paper, we understand the feature selection problem as follows: Given a dataset, select an appropriately small number of features that faithfully determine the classes of the examples of the dataset. If we could find a feature subset such that the features correctly determine the classes of the examples, it could be an answer we wanted [Almuallim and Dietterich, 994]. Such feature subsets are referred to as reducts in the rough set theory [Pawlak, 99] and as consistent feature sets in this paper [Liu et al., 998]. A dataset, however, does not necessarily include a consistent feature subset, and in such cases, we have to be satisfied with selecting feature subsets that are sufficiently close to being consistent. In other words, we need a measure that can measure the distance of a feature subset to the closest consistent feature set. However, we only know two necessary conditions for such a measure, namely, determinacy and monotonicity: Determinacy corresponds to the identity of indiscernibles of the axioms of metrics and requires that the measurement is zero, if, and only if, the feature set is consistent; Monotonicity, on the other hand, requires that the measurement of an arbitrary feature set is no greater than the measurement of its subset. When a measure has the determinacy and monotonicity properties, we also say that it is determinant and monotonous. With a determinant and monotonous measure, a feature selection algorithm attempts to find feature sets whose measurements are close to zero. Generally, the feature selection measures proposed in the literature are categorized into two groups. One consists of measures that simply evaluate dependence between two features [Mengle and Goharian, 29; Bolón-Canedo et al., 2; Foithong et al., 22; Suzuki and Sugiyama, 23; Wang, 25]. The framework of mrmr [Peng et al., 25] takes advantage of measures of this group and aims to select a feature set that maximizes the difference of the relevance from the redundancy of the feature set: The relevance is the collective dependence of the feature set to class labels, while the redundancy is the the internal mutual dependence of the features of the feature set. On the other hand, the other consists of determinant and monotonous measures [Liu et al., 998; Shin and Xu, 29; Arauzo-Azofra et al., 28], and 4 of the 7 feature selection measures studied in [Molina et al., 22] are known to belong to this group [Shin et al., 2]. In this paper, we are interested in the latter group. Example Bayesian risk, also known as inconsistency rate [Liu et al., 998], is defined as follows. µ D br (S) = ( ) p(s = x) max p(s = x, C = c). x Ω S S is a feature subset of a dataset D, and C is the random variable to represent classes. Ω S and Ω C are the sample spaces of S and C, and p is the empirical probability distribution of D. This measure is, indeed, determinant and monotonous. Example 2 Zhao et al. [Zhao and Liu, 27] proposed Interact, a feature selection algorithm that leverages the Bayes 382

2 risk µ br to evaluate the closeness of feature sets to the state of being consistent. Therefore, Interact selects feature sets with small µ br measurements. Example 3 Shin et al. [Shin and Xu, 29] also proposed the conditional entropy defined by µ D ce (S) = x Ω S p(s = x, C = c) log p(s = x, C = c) p(s = x) as a determinant and monotonous measure. On the other hand, the well-known feature selection algorithm that selects the top n features X with respect to the mutual information I(X; C) actually selects those feature sets that minimize µ D ({X,..., X n }) = n µ D ce ({X i }). i= By using µ D instead of µ ce D, this algorithm is fast but is not very accurate, since µ D is not determinant. When a measure represents the distance of a feature set to the closest consistent feature set in some metric space, we call it distance-based. About distance-based measures, we only knew two necessary conditions, that is, determinacy and monotonicity, and hence, the distance-based measure was only an analogy to explain the desirable natures of feature selection measures. This paper changes this. In this paper, we present a way to identify a feature subset of a dataset uniquely with a point in a common metric space. The metric space is parameterized by a real number p [, ], and hence, we have infinitely many such metric spaces. In the metric space, consistent feature sets form a closed subspace, and hence, we can determine the distance of an arbitrary feature set to the closest consistent feature set. This is nothing other than a distance-based measure. A different values of p determines a different metric function. For p =, we show that the measure is identical to the wellknown Bayesian risk. On the other hand, for p >, the measure is novel and unknown in the literature. Also, through experiments, we show that measurements by our novel distancebased measures correlate with classification accuracy. 2 Formulating feature selection measures In this paper, we do not discriminate between features and random variables. Furthermore, for a dataset D, we use the following notations. F D is the entire set of features of D. F D is finite and includes the feature C D that represents classes. Also, we denote F D \ {C D } by F D. Ω F D = X F D Ω X is the entire sample space, where Ω X is the sample space of an individual feature X. For a feature set S, Ω S means the Cartesian product X S Ω X, which is the sample space of S. p D : Ω F D Q is the empirical probability distribution. We also denote D by the triplet (F D, Ω F D, p D ). For S F D, D S is the dataset derived from D by eliminating the features in F D \ S. For S F D, D S is the dataset derived from D by replacing the features of S with a single feature S such that Ω S = Ω S. The values for S of an example in D are replaced with the vector consisting of the values. Example 4 For the dataset D determined by (a) below, we have F D = {B, R, G, C D } and Ω F D = {A, B, O, AB} {M, N, C, A} {M, F } {p, n}. When S = {R, G}, (b) and (c) determine D S and D S. B (a) D R G C D A M M p A M M n B N M p O C F n AB A M n (b) D S R G C D M M p M M n N M p C F n A M n (c) D S B S C D A (M, M) p A (M, M) n B (N, M) p O (C, F) n AB (A, M) n On the other hand, a feature selection measure µ is formulated as a family of µ D indexed by datasets D, and each µ D is a real-valued function defined over the power set P(F D ) of F D. In this paper, we assume that a feature selection measure supports the following three requirements, which all of the 9 measures surveyed in [Shin et al., 2] support as well. Requirement (Naming invariance) Measurements by the measure are invariant to renaming of features and feature values. This also implies that the measure assumes that every feature is categorical. Requirement 2 (Localization invariance) If T S, µ D (T ) = µ D S (T ) holds. Requirement 3 (Feature aggregation invariance) If T S, µ D (T ) = µ D S ((T \ S) { S}) holds. Example 5 Bayesian risk µ br has all of these invariance properties: µ D br (F D ) = µ E br (F E ) = 5 indicates the naming invariance; For S = {R, G}, µ D br ({G}) D and µ S br ({G}) are identical to 2 5, and E F F 2 F 3 C D D µ S br ({B, S}) = 5 holds These indicate the localization and feature aggregation invariance properties. In this paper, we will identify (D, S) and (E, T ), pairs of a dataset and a feature subset, if µ D (S) = µ E (T ) holds for any measure µ that supports Requirements, 2 and 3. For this, we introduce the notion of coherent mappings and then prove Proposition. Definition A mapping ϕ : F D F E is coherent between D and E, if, and only if, there exist injections v Y : Ω ϕ (Y ) Ω Y for Y Im(ϕ) and an injection v CE : Ω CD Ω CE such that p D (x) = p E (y), where v π = Y Im(ϕ) {C E } y:y Im(ϕ) {CE }=v π(x) v Y : Ω F D Ω Im(ϕ) {CE }. 383

3 Example 6 We consider the datasets D of Example 4 and E below. When we define ϕ by ϕ(b) = F and ϕ(r) = ϕ(g) = F 2, ϕ is coherent. To show this, we determine v F and v F2 as below and v CE by v CE (p) = and v CE (n) = 2. For example, p D (A, M, M, p) = 5 and p E(,,, ) + p E (,, 3, ) = + = 5 hold. E F F 2 F 3 C E F F 2 F 3 C E v F v F2 A (M, M) (C, M) 5 B 2 (M, F) 2 (C, F) 6 O 3 (N, M) 3 (A, M) 7 AB 4 (N, F) 4 (A, F) 8 Proposition If a feature selection measure µ supports Requirement to 3, µ E (S) = µ D (ϕ (S)) holds for an arbitrary coherent mapping ϕ : F D F E between D and E and an arbitrary S Im(ϕ). Proof. We let ϕ = ϕ ϕ such that ϕ : F D Im(ϕ) and ϕ : Im(ϕ) F E. Since ϕ yields a sequence of feature aggregations that convert D into E Im(ϕ) up to renaming of features and feature values, µ E (S) = µ E Im(ϕ) (S) = µ D (ϕ (S)) follows from Requirement to 3. Example 7 For the same D, E and ϕ as Example 6, µ br E ({F, F 2 }) = µ br D ({B, R, G}) = µ br E ({F }) = µ br D ({B}) = µ br E ({F 2 }) = µ br D ({R, G}) = 5 holds. 3 Projecting a feature set of a dataset into a subspace P + R of the l p space (p [, ]) We let n N { }. We introduce the space P n into which feature sets of datasets are projected. We start with preparation. Let K be either the rational number field Q or as the real number field R. We define P K and P + K follows, which are sets of functions p such that p : N 2 K. Definition 2 We define P K and P + K by: { P } K = p p(i, j), p(i, j) = ; { P + K = p p(i, j), i= j= i= j= } p(i, j). We can identify N 2 with N (for example, f : (i, j) (i+j )(i+j 2) 2 + i is bijective), and therefore, P R and P + R can be viewed as subspaces of l p for p. If we focus on datasets that include at most n classes, we can use P n K and P + n K instead. Definition 3 We define P n K and P + n K by: P n K = { p P } K j > n p(i, j) = ; P + n { K = p P + } j K > n p(i, j) = ; P n R and P + n R are metric spaces with the metric derived from the norm p of l p. P n R is closed in l p only when p =, while P + n R is the closure of n PR in l p for < p. Therefore, P n R in l and P + n R in l p for < p are complete, since l p n is a Banach space. Although P R and P + n R are bounded, none of them is compact. We define P n, a subspace of P n R, as follows. Definition 4 We let supp(p) be the smallest Ω U Ω C such that Ω U N, Ω C N and Ω U Ω C Supp(p) = {(i, j) p(i, j) > }. We define P and P n for n N by: P = {p P Q supp(p) < }; P n = {p P j > n p(i, j) = }. Example 8 When we define p by p(i, i) = ( r)r i for r (, ), p P R holds, because i= ( r)ri = r ( r) lim n n r =. Furthermore, if r Q, p P Q holds. For this p, we have Ω U = N, Ω C = N and supp(p) = {(i, i) i N}. Hence, p / P. For n N { }, P n turns out to be dense in P R n. For p P n, D p = ({U, C}, N N, p) is not a dataset, because N N is infinite. Nevertheless, we can think of a coherent mapping ϕ : F D {U} between a dataset D and D p. Theorem shows how to project a feature set S of a dataset D into P n. Theorem We let D be a dataset with n classes and S F D. There exists p P n such that ϕ : S {U} is coherent between D S and D p. Conversely, for arbitrary p P n, there exists a dataset D with n classes such that ϕ : F D {U} is coherent between D and D p. Proof. Let v : Ω S N and v 2 : Ω CD {,..., n} be injective. ( It suffices to define p P n by p(i, j) = p D S v (i), v 2 (j)) if (i, j) Im(v v 2 ) and p(i, j) = otherwise. To show the converse, we have only to let D = ({U, C}, supp(p), p). Example 9 To project D S of Example 4 into P 2, we determine p P 2 by p(, ) = p(, 2) = p(3, ) = p(6, 2) = p(7, 4) = 5 and p(i, j) = otherwise. When we determine v U and v C identical to v F2 and v CE of Example 6, ϕ : S {U} is coherent between D D and D p, and hence, D D is projected to p. Example A single D S can be projected to infinitely many points in P n. For example, we let D S, p and v U be the same as Example 9. For an arbitrary bijection π : N N, we define p π by p π (π(i), j) = p(i, j). Apparently, D S projects to p π with π v U. 384

4 Example More than one datasets are projected to a single point in P n. For example, D S of Example 4 and E {F2} of Example 6 are both projected to any of p π of Example. To be precise, a single point in P n has infinitely many datasets that project to the point. Based on Proposition and Theorem, Theorem 2 asserts that a feature selection measure can be uniquely viewed as a real-valued function defined over P n. Theorem 2 For a feature selection measure µ, ˆµ : P R uniquely exists and ˆµ(p) = µ D (S) holds for arbitrary D, S and p such that ϕ : S {U} is coherent between D S and D p. Proof. We let Dp be ({U, C}, supp(p), p) for p P. We determine ˆµ(p) by µ D p ({U}). For a dataset D such that ϕ : S {U} is coherent between D S and D p with v U : Ω S N and v C : Ω CD N, we let D p # = ({U, C}, Im(v U ) Im(v C ), p). ϕ : S {U} is coherent between D S and D p #, and the inclusion supp(p) Im(v U ) Im(v C ) yields coherence between Dp and D p #. Therefore, µ D (S) = µ D# p ({U}) = µ D p ({U}) = ˆµ(p). Example 2 Datasets that are projected to the same point in P n have the same measurement for an arbitrary feature selection measure µ. For example, as seen in Example, D S of Example 4 and E {F2} of Example 6 are both projected to the same point P 2. On the other hand, µ D br (S) = µ E br ({F 2 }) = 5 holds as seen in Example 7. Therefore, by letting ˆµ(p) = µ D (S) for D S that projects to p, ˆµ is well defined. The problem of the projection determined by Theorem consists in the fact that a single pair (D, S) is mapped to an infinite number of different points in P n (Example ). In the next section, we will solve this problem. 4 Defining a quotient space Q + R of P + R Our idea to solve the problem is to introduce an equivalence relation that unifies points of P n that are the images of the same single pair (D, S) and then use the resulting quotient space. The relation defined below is an equivalence relation. We assume n N { }. Definition 5 Let p and q be in P + R. We define p q, if, and only if, there exist bijections v U : N N and v C : N N such that p(i, j) = q(v U (i), v C (j)). Example 3 p π p holds for p π and π of Example. Definition 6 We let Q + R = P + R /. Moreover, for the canonical projection π : P + R Q + R, we let Q + n R = π(p + n R ), n QR = π(p n R ) and Q n = π(p n ). Proposition 2 is easy to see but plays a crucial rule to prove Theorem 3 and 4. Proposition 2 For p and q in P, the following are equivalent.. p q. 2. There exists (D, S) such that ϕ : S {U} is coherent between D S and D p and between D S and D q. 3. ϕ : S {U} is coherent between D S and D p, if, and only if, it is coherent between D S and D q. Theorem 3 and 4 follow from Theorem, Theorem 2 and Proposition 2. In the following, [p] denotes π(p), and [D, S] denotes the image of (D, S) in Q. ϕ : S {U} is coherent between D S and Dp Theorem 3 We let D be a dataset with n classes and S F D. There uniquely exists [p] Q n such that ϕ : S {U} is coherent between D S and D p. Conversely, for arbitrary [p] Q n, there exists a dataset D with n classes such that ϕ : F D {U} is coherent between D and D p. [D, S] (D, S) π P n Q n Theorem 4 For a feature selection measure µ, there uniquely exists [ˆµ] : Q R such that [ˆµ]([D, S]) = µ D (S) holds for any dataset D and S F D. 5 Introducing a metric d p into Q + R Although a quotient of a metric space is not always a metric space, we can derive a metric into Q + R from l p as follows. Definition 7 For [p] and [q] in Q + R, we define d p ([p], [q]) = inf{ p q p p p, q q }. Theorem 5 For p, d p is a metric over Q + R. Proof. We only prove the triangle inequality d p ([p], [r]) d p ([p], [q]) + d p ([q], [r]) taking advantage of Lemma. Lemma For p p and q in P + R, there exists q q such that p q p = p q p. For ε >, we assume p q p d p ([p], [q]) < ε 2 and q r p d p ([q], [r]) < ε 2 for q q. By Lemma, we have q r p = q r p and d p ([p], [r]) p r p p q p + q r p < d p ([p], [q]) + d p ([q], [r]) + ε. 6 Deriving a measure µ l p from d p We see that the minimum of d p distance from [D, S] to a consistent feature set determines a feature [D, S] selection measure. We first define consistent points of P + R. π(c ) Q n l p µ l p([d, S]) Definition 8 p P + R is consistent, if, and only if, there exists j : N N such that p(i, j) = if j j(i). 385

5 C + R denotes the entire set of consistent p, and we let C + n R = C + R P + n R, n CR = C + R n PR and C n = C + R n P for n N { }. Definition 9 For j : N N such that j(i) argmax{p(i, j) j N} and p P, we define: For p <, µ l p(p) = p p(i, j) p p(i, j(i)) p ; i N For p =, µ l (p) = max { p(i, j) (i, j) N 2, (i, j) (i, j(i)) }. Surprisingly, µ l is identical to the Bayesian risk. Proposition 3 If ϕ : S U is coherent between D S and D p, µ br D (S) = µ l (p) holds. Proof. We assume that v U : Ω S N and v C : Ω C N yield the coherence between D S and D p. µ D br (S) = ( ) p D S (S = x) max p D S (S = x, C = c) x Ω S = ( x Ω S = µ l (p). p(v U (x), v C (c)) max p(v U (x), v C (c)) Theorem 6 For p P, the following holds.. µ l (p) = 2 min{d ([p], [q]) q C }. 2. For < p, µ l p(p) = inf{d p ([p], [q]) q C } = min{d p ([p], [q]) q C + R supp(q) < }. Proof. Here, we prove only. For q C, we assume j : N N such that q(i, j) = if j j(i). Also, we let j : N N satisfy j(i) argmax{p(i, j) j N}. We first fix i N. p(i, j) q(i, j) = p(i, j(i)) q(i, j(i)) + \{j(i)} p(i, j) p(i, j(i)) q(i, j(i)) + p(i, j) max p(i, j). ) Then, we sum up the both sides across all i N. p q p(i, j(i)) q(i, j(i)) i N i N + i N p(i, j) max{p(i, j) j N} q(i, j(i)) i N p(i, j(i)) + µ l (p) 2 µ l (p). On the other hand, for q C such that q(i, j) = p(i, j) if j = j(i) and otherwise, we have p q = 2 ( i N max{p(i, j) j N} As q C R if q q, Lemma implies the assertion. Determinacy of µ l p immediately follows from Theorem 6. Also, it is easy to show monotonicity of µ l p. Furthermore, µ l p turns out continuous under d q for arbitrary p, q. 7 Contrasting d and d p with p > Proposition 3 asserts that µ l is identical to the Bayesian risk, which is well known in the literature. On the other hand, µ l p with p (, ] is a novel measure unknown in the literature. Also, as a metric function, d and d p with p (, ] are significantly different as stated below without proof. Q R is complete under d, while Q + R is complete under d p for p (, ]. Q R n is not compact under d, while Q + R n is compact under d p for p (, ]. Q n is dense in Q R n under d, while it is dense in Q + R n under dp for p (, ]. Moreover, in Section 8, we will see that µ l and µ l p for p (, ] have different properties in terms of correlation between their measurements and classification accuracy. Although we cannot explain the reason for this difference, we imagine that it is related with the aforementioned difference between d and d p with p (, ] as a metric function. 8 Correlation between measurements by µ l p and classification accuracy We evidently expect that there exists strong correlation between measurements by a good measure and classification accuracy. From this point of view, we ran experiments with µ l p for p =, 2,..., 5. In particular, we focus on comparison between p = and p > : µ l is identical to the Bayesian risk and is the feature selection measure that is used the most widely in the literature; On the other hand, the other measures are novel and introduced in this paper for the first time. ). 386

6 Table : Datasets NAME #FEAT. #EXAM. #CLASSES ARRHYTHMIA AUDIOLOGY MFEAT-FACTOR 26 2 MFEAT-FOURIER 76 2 MFEAT-KARHUNEN 64 2 MFEAT-PIXEL 24 2 MFEAT-ZERNIKE 47 2 MUSK OPTIDIGITS SONAR SPAMBASE SPECTROMETER Datasets Table shows the datasets that we use in our experiments as well as their important attributes, namely, the numbers of features, examples and classes. All of the datasets are obtained from the UCI repository of machine learning databases [Blake and Merz, 998]. 8.2 Methods The following are the steps of our experiments. Sampling feature sets For each dataset of Table, 6 feature sets are selected at random. The sizes of the selected feature sets also vary at random. In total, we obtain 6 2 = 72 pairs of a feature set and a dataset. Localizing datasets For each pair of a feature set and a dataset, we localize the dataset by eliminating all of the features that do not belong to the relevant feature set. Consequently, we obtain 72 localized datasets. Measuring accuracy of classification We apply three classifiers, namely, Naïve Bayes, C4.5 and SVM, to each of the localized datasets in the manner of -fold cross validation and record the averaged AUC-ROC scores, which we use as classification accuracy scores. 8.3 Results Figure, 2 and 3 display scatter plots of the results of our experiments with Naïve Bayes, C4.5 and SVM, respectively. The x-axis of each chart represents measurements by a measure out of (a) µ l, (b) µ l 2, (c) µ l 3, (d) µ l 4 and (e) µ l 5, while the y-axis does the averaged AUC-ROC scores. From the charts, we have the impression that µ l p with p > shows stronger negative correlation with classification accuracy than µ l. We look into this more closely in the next subsection. 8.4 Analysis on correlation To compare µ l and µ l p with p > in terms of correlation with classification accuracy, we introduce a function P t (x) as an index. We let N x be the number of plots whose µ l p distances fall within [x, x +.) and N t,x be the number of plots whose AUC-ROC scores exceed t and µ l p distances fall within [x, x+.). Then, we define P t (x) by P t (x) = Nt,x N x. Intuitively, P t (x) approximates the probability of the case that the classification accuracy exceeds t when the measurement by the relevant measure is x. Figure 4, 5 and 6 show the curves of P t (x) for Naïve Bayes, C4.5 and SVM. The value of t varies in {.95,.9,.85,.8}. Since P t (x) P t (x) holds for t < t, the curve for t is located above the curve for t. We observe two properties from the charts. The curves for µ l p for p > are akin to one another in shape, while the curves for µ l appear significantly different from the other measures. The curves for µ l p with p > exhibits clearer negative correlation between measurements by the measure and classification accuracy than the curves for µ l. In fact, the table below presents the correlation coefficients between the measurements x and the values of P t (x). The presented values also support the observation stated above. 9 Conclusion t = NAÏVE BAYES µ l µ l µ l µ l µ l C4.5 µ l µ l µ l µ l µ l SVM µ l µ l µ l µ l µ l We have introduced a family of feature selection measures { µ l p p [, ]} that are derived from the distance functions of some metric spaces. Although µ l turns out to be identical to the Bayesian risk, µ l p for p (, ] are novel. Through experiments, we have shown that µ l p with p > exhibits clearer negative correlation between its measurements and classification accuracy than µ l does. 387

7 . (a) µ l (b) µ l 2 (c) µ l 3 (d) µ l 4 (e) µ l BR DST2 DST3 DST4 DST Distance Distance Distance Distance Distance.6.8. Figure : Scatter plots of the experimental results (Naïve Bayes). BR. DST2. DST3. DST4. DST Distance Distance Distance Distance Distance.6.8. Figure 2: Scatter plots of the experimental results (C4.5). BR. DST2. DST3. DST4. DST Distance Distance Distance Distance Distance.6.8. t=.95 t=.9 t=.85 t=.8 Figure 3: Scatter plots of the experimental results (SVM) t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 Precision- NB- BR Precision- NB- DST2 Precision- NB- DST3 Precision- NB- DST4 Precision- NB- DST x x x x x t=.95 t=.9 t=.85 t=.8 Figure 4: The curves of P t (x) for t {.95,.9,.85,.8} (Naïve Bayes) t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 Precision- C4.5- BR Precision- C4.5- DST2 Precision- C4.5- DST3 Precision- C4.5- DST4 Precision- C4.5- DST x x x x x t=.95 t=.9 t=.85 t=.8 Figure 5: The curves of P t (x) for t {.95,.9,.85,.8} (C4.5) t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 t=.95 t=.9 t=.85 t=.8 Precision- SVM- BR Precision- SVM- DST2 Precision- SVM- DST3 Precision- SVM- DST4 Precision- SVM- DST x x x x x Figure 6: The curves of P t (x) for t {.95,.9,.85,.8} (SVM)

8 References [Almuallim and Dietterich, 994] H. Almuallim and T. G. Dietterich. Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69( - 2), 994. [Arauzo-Azofra et al., 28] A. Arauzo-Azofra, J. M. Benitez, and J. L Castro. Consistency measures for feature selection. Journal of Intelligent Information Systems, 3(3): , November 28. [Blake and Merz, 998] C. S. Blake and C. J. Merz. UCI repository of machine learning databases. Technical report, University of California, Irvine, 998. [Bolón-Canedo et al., 2] V. Bolón-Canedo, S. Seth, and N. Sánchez-Maro no. Statistical dependence measure for feature selection in microarray datasets. In ESANN 2, pages 23 28, 2. [Foithong et al., 22] S. Foithong, O. Pinngern, and B. Attachoo. Feature subset selection wrapper based on mutual information and rough sets. Expert Systems with Applications, 39(): , 22. [Liu et al., 998] H. Liu, H. Motoda, and M. Dash. A monotonic measure for optimal feature selection. In Proceedings of European Conference on Machine Learning, 998. [Mengle and Goharian, 29] S.S.R. Mengle and N. Goharian. Ambiguity measure feature-selection algorithm. Journal of the American Society for Information Science and Technology, 6(5):37 5, 29. [Molina et al., 22] L. Molina, L. Belanche, and A. Nebot. Feature selection algorithms: A survey and experimental evaluation. In Proceedings of IEEE International Conference on Data Mining, pages 36 33, 22. [Pawlak, 99] Z. Pawlak. Rough Sets, Theoretical aspects of reasoning about data. Kluwer Academic Publishers, 99. [Peng et al., 25] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteria of maxdependency, max-relevance and min-redundancy. IEEE Transaction on Pattern Analysis and Machine Intelligence, 27(8), August 25. [Shin and Xu, 29] K. Shin and X.M. Xu. Consistencybased feature selection. In 3th International Conferece on Knowledge-Based and Intelligent Information & Engineering System, 29. [Shin et al., 2] K. Shin, D. Fernandes, and S. Miyazaki. Consistency measures for feature selection: A formal definition, relative sensitivity comparison, and a fast algorithm. In 22nd International Joint Conference on Artificial Intelligence, pages , 2. [Suzuki and Sugiyama, 23] T. Suzuki and M. Sugiyama. Sufficient dimension reduction via squared-loss mutual information estimation. Neural Computation, 25(3): , 23. [Wang, 25] L. Wang. Feature selection algorithm based on conditional dynamic mutual information. International Journal on Smart Sensing and Intelligent Systems, 8(), 25. [Zhao and Liu, 27] Z. Zhao and H. Liu. Searching for interacting features. In Proceedings of International Joint Conference on Artificial Intelligence, pages 56 6,

ENTROPIES OF FUZZY INDISCERNIBILITY RELATION AND ITS OPERATIONS

ENTROPIES OF FUZZY INDISCERNIBILITY RELATION AND ITS OPERATIONS International Journal of Uncertainty Fuzziness and Knowledge-Based Systems World Scientific ublishing Company ENTOIES OF FUZZY INDISCENIBILITY ELATION AND ITS OEATIONS QINGUA U and DAEN YU arbin Institute