Multivariate interdependent discretization in discovering the best correlated attribute

Size: px

Start display at page:

Download "Multivariate interdependent discretization in discovering the best correlated attribute"

Gertrude Conley
5 years ago
Views:

1 Data Mining VI 35 Multivariate interdependent discretization in discovering the best correlated attribute S. Chao & Y. P. Li Faculty of Science and Technology, University of Macau, China Abstract The decision tree is one of the most widely used and practical methods in data mining. However, many discretization algorithms developed in this field focus on the univariate only, which discretize continuous-valued attributes independently, without considering the interdependent relationship between other attributes, at most taking the class attribute into account. Such univariate discretization is inadequate to handle the critical problems especially owned in the medical domain. In this paper, we propose a new multivariate discretization method called Multivariate Interdependent Discretization for Continuous Attributes MIDCA. This method incorporates the normalized relief and information measures to look for the best correlated attribute with respect to each continuous-valued attribute being discretized, and using the discovered best correlated attribute as the interdependent attribute to carry out the multivariate discretization. We believe that a good multivariate discretization scheme for continuous-valued attributes should rely highly on their perfect correlated attributes respectively. Among an attribute space, each attribute should have at least one most relevant attribute that may be different from others. Our novel multivariate discretization algorithm can minimize the uncertainty between the interdependent attribute and the continuous-valued attribute being discretized and at the same time maximize their correlation. Such a method can be used as a pre-processing step for the learning algorithms. The empirical results demonstrate a comparison of performance between MIDCA and various discretization methods for two decision tree algorithms ID3 and C4.5 on twelve real-life datasets from UCI repository. Keywords: multivariate discretization, interdependent feature, correlated attribute, data mining, machine learning.

2 36 Data Mining VI 1 Introduction The decision tree is one of the most widely used and practical methods for inductive inference in the data mining and machine learning discipline )Han and Kamber [1]). Most decision tree learning algorithms are limited to handle the attributes with discrete values only. However, the datasets are always the mix of discrete and continuous values of attributes. The common method to handle continuous-valued attributes is to discretize them by dividing them into intervals. Moreover, even if a learning algorithm is able to deal with continuous-valued attributes directly, it is still better to carry out the discretization prior the learning algorithm, so as to minimize the information lost and increase the classification accuracy. Many discretization algorithms developed in data mining focus on univariate, which discretize each continuous-valued attribute independently without considering the interdependent relationship between other attributes, at most taking the interdependent relationship between class attributes into account. The simplest discretization method is equal width interval binning (Dougherty et a.l [2]), which divides the range of a continuous-valued attribute into several equally sized bins. It makes no use of class attribute and thus it is an unsupervised discretization method. The best discretization algorithms are supervised that take the class attribute information into consideration. One is entropy-based, it recursively partitions a continuous-valued attribute to obtain the minimal entropy measure (Fayyad and Irani [3]), and uses the minimum description length principle to be the stopping criteria; the other is based on the chi-square statistics (Liu and Setiono [4]), which aims at having the most similar distribution to the original data even after discretization. Evaluations and comparisons of some supervised and unsupervised univariate discretization methods can be found in [2, 5]. As Bay [6, 7] indicated, the discretized intervals should make sense to human expert. For example, when learning the medical data for hypertensive patients, we know that a person s blood pressure is increasing as his/her age increasing. Therefore it is improper to set 140mmHg and 90mmHg as systolic pressure and diastolic pressure respectively for all patients. Since the standard for diagnosing hypertension is a little bit different from young people (orthoarteriotony is mmHg/80mmHg) to the old people (orthoarteriotony is 140mmHg/90mmHg) [8]. If the blood pressure of a person aged 20 is 139mmHg/89mmHg, he/she might be considered as a potential hypertensive. In contrast, if a person aged 65 has the same blood pressure measures, he/she is definitely considered as normotensive. Obviously, to discretize the continuous-valued attribute blood pressure, it must take at least the attribute age into consideration. While discretizing other continuous-valued attribute may not take age into consideration. The only solution to address the mentioned problem is to use multivariate interdependent discretization in place of univariate discretization. Multivariate interdependent discretization concerns the correlation between the attribute being discretized and the other potential interdependent attributes in addition to the class attribute. There are few literatures discussed about the

3 Data Mining VI 37 multivariate interdependent discretization methods. In this paper, we propose a new multivariate interdependent discretization method that can be used as a preprocessing step for the learning algorithms, called Multivariate Interdependent Discretization for Continuous Attributes MIDCA. The method is based on the normalized relief and information measure to look for the best correlated attribute for each continuous-valued attribute being discretized, and using it as the interdependent attribute to carry out the multivariate discretization. In the next section, we describe our discretization method in detail. The evaluation of the proposed algorithm on some real datasets is performed in section 3. Finally, we discuss the limitations of the method and present the directions for our further research in section 4. 2 MIDCA algorithm In order to obtain the good quality for a multivariate discretization, to discover a best interdependent attribute respect to each continuous-valued attribute being discretized is considered as the primary vital task. To measure the correlation between attributes, entropy measure [3, 9, 10] and relief theory [11, 12] are adopted. Relief is a feature-weighting algorithm for estimating the quality of attributes such that it is able to discover the interdependencies among attributes. Entropy from information theory is a measure of the uncertainty for an arbitrary variable. In this section, we first recall the entropy information and theory of relief and then describe our discretization method in detail. 2.1 Entropy information Entropy specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of instances [9, 10]. Given a collection of instances S, containing C types of examples of a target attribute, the entropy of S relative to this C-classification is defined as Entropy( S) = p( S ) log( p( S )). (1) i C where p(s i ) is the proportion of S belonging to class i. Based on this measure, we may find out the most informative attribute A relative to a collection of examples S by defining the measure called information gain S v Gain( S, A) = Entropy( S) Entropy( S ). (2) v v Values( A) S where Values(A) is the set of all distinct values of attribute A; S v is the subset of S S = s S A( s) = v. for which attribute A has value v, that is { } 2.2 Relief The key idea of relief (Kira and Rendell [11, 12]) is to estimate the quality of an attribute by calculating how well its values distinguish among the instances from both same class and different class. A good attribute should have the same values for instances from the same class and should differentiate between instances v i i

4 38 Data Mining VI from the different classes. Kononenko [13] notes that Relief attempts to approximate the following difference of probabilities for the weight of an attribute A Relief = P(different value of A different class) A (3) P(different value of A same class). which can be reformulated as 2 Gini '( A) p( x) x X Relief =. (4) A 2 2 (1 pc ( ) ) pc ( ) c C c C where C is the class attribute and 2 px ( ) Gini '( A) = p( c)(1 P( c)) p( c x)(1 p( c x)). (5) 2 c C x X px ( ) c C x X Gini is a variance of another attribute quality measure algorithm Gini-index (Breiman [14]). 2.3 MIDCA Our proposed multivariate discretization method MIDCA is interested mainly in discovering the best interdependent attribute relative to the continuous-valued attribute being discretized. Among an attribute space, attributes should have certain relevancy between each other. No matter how loose or tight the relationships are, there must exist at least one such interdependent attribute that perfect correlates with the continuous-valued attribute being discretized. As we believe that a good multivariate discretization scheme for continuous-valued attributes should highly rely on their perfect correlated attributes respectively. We assume that a dataset S = {s 1, s 2,, s N } contains N instances. Each instance s S is defined over a set of M attributes (features) A = {a 1, a 2,, a M } and a class attribute c C. For each continuous-valued attribute a i A, there exists at least one a j A, such that a j is the most correlated attribute respect to a i, or vice versa, since the interdependent weight is measured in symmetrically. For the purpose of finding out such a best interdependent attribute a j for each continuous-valued attribute a i, both entropy information and relief measure are taken into account to capture their interactions among the attributes space A. First, for each attribute pair (a i, a j ) A where i j, we calculate the correlation weights by utilizing both symmetric relief and entropy information. Then normalize the two measures and finally get the best result as our target. The algorithm can be defined as InterdependentWeight( a, a ) = i SymGain( a, a ) SymRelief ( a, a ) i j i j + A A 2 2 SymGain( a, a ) (, ) i M SymRelief a a i M M i M i j /2. (6)

5 Data Mining VI 39 SymGain(a i, a j ) and SymRelief(a i, a j ) are two symmetric forms of information gain and relief measures respectively, which treated either a i or a j in turn to be the class attribute C in the formula. That is SymGainAB (, ) = [ GainAB (, ) + GainBA (, )]/2. (7) and 2 2 Gini '( A) p( x) Gini '( B) p( y) x X y Y SymRelief ( A, B) = + /2. (8) (1 pb ( ) ) pb ( ) (1 pa ( ) ) pa ( ) b B b B a A a A The advantage of incorporating the measures of information gain and relief in our multivariate interdependent discretization algorithm is to minimize the uncertainty between the continuous-valued attribute being discretized and its interdependent attribute, and at the same time to maximize their correlation. The measures output from eqns (7) and (8) are in different standards, the only way to balance them is to normalize each result by using proportions in place of real values. Thus, a best interdependent attribute with respect to the continuousvalued attribute being discretized is determined by further averaging the two normalized proportions for which the interdependent weight is the best amongst all the potential interdependent attributes. However, if a potential interdependent attribute is a continuous-valued attribute too, it is first discretized with entropybased method [2, 3]. This is important and may reduce the bias of in favor of the attribute with more values. Furthermore, our method creates an interdependent attribute for each continuous-valued attribute in a dataset rather than using one for all continuous-valued attributes, this is also the main factor for improving the final classification accuracy. Once the best interdependent attribute has been discovered, the multivariate interdependent discretization process carries out by adopting the most efficient supervised discretization algorithm Minimal Entropy Partitioning with MDLP (Fayyad and Irani [3]). Nevertheless, our method makes several differences compared with the original one. First, our method ensures at least binary discretization for each continuous-valued attribute, which is different from the original method that sometimes the boundary of a continuous-valued attribute is [-, + ]. We realized that if a continuous-valued attribute generates null cutting point means the attribute is useless and will be ignored during learning process. This may conflict with our belief that most continuous-valued attributes in medical domain have their specific meanings. For example, most figures express the degrees of illness, such as blood pressure, heart rate, cholesterol, etc., so their discretization cannot be ignored. Second, our discretization carries out with respect to the best interdependent attribute that discovered from eqn (6) in addition to the class attribute. Moreover, we assume that the interdependent attribute INT has T discrete values; as such each of its distinct value identifies a subset in the original dataset, the probability should be generated relative to the subset in place of the whole dataset. Therefore, the combinational probability distribution over the attribute space {C} A is redefined as well as the information gain algorithm as

6 40 Data Mining VI MIDCAInfoGain( A, P; INT, S) = T S (9) v Entropy( S INT ) ( ). T Entropy Sv v Values( A) INTT S where the algorithm defines the class information entropy of the partition induced by P, which is a collection of candidate cutting points for attribute A and under the projection of value T for the interdependent attribute INT. We replace the Entropy(S) with the conditional entropy Entropy(S INT T ) to emphasize the interaction between the interdependent attribute INT. As a consequence, v Values(A) INT T becomes the set of all distinct values of attribute A of the cluster induced by T of interdependent attribute INT; S v is the subset of S for which attribute A has value v and under the projection of T for INT, that is S = s S A( s) = v INT( s) = T. v { } 2.4 MIDCA high level descriptions We now present the high level descriptions of our MIDCA algorithm and the algorithm INTDDiscovery for discovering the best-correlated interdependent attribute as follows: Algorithm MIDCA For each continuous-valued attribute A Sort A in ascending order; Discover the best interdependent attribute of A by INTDDiscovery; Repeat Discover the best cutpoints by MIDCAInfoGain measure; Until MDLP = pass; Regenerate the dataset according to the obtained cutpoints; End MIDCA. Algorithm INTDDiscovery For each attribute atr other than A If atr is a continuous-valued attribute Discretize atr using entropy-based method; Calculate symmetric entropy SymGain for A and atr; Calculate symmetric relief SymRelief for A and atr; Normalize SymGain and SymRelief; Average SymGain and SymRelief; Output the attribute with the highest average measure; End INTDDiscovery. 3 Experiments In this section, our empirical evaluation results are presented. We have tested our method MIDCA on twelve real-life datasets from UCI repository (Blake and Merz [15]), which containing a mixture of continuous and discrete attributes. The details of each dataset are listed in table 1. In order to make comparisons between MIDCA algorithm and different discretization methods, we simulated the univariate and multivariate discretization methods. While the interdependent

7 Data Mining VI 41 attributes of multivariate discretizations are obtained by using Relief and Gain Ratio respectively. In the experiment, MIDCA and various discretization methods are used as pre-processing steps for the two learning algorithms: ID3 Quinlan [16], and C4.5 Quinlan [17, 18]. Table 1: Twelve real-life datasets from UCI. Features Instances Size No. Dataset Training Testing Classes Continuous Discrete set set 1 Cleve Hepatitis Hypothyroid Heart Sickeuthyroid Iris Australian Auto Breast Crx Diabetes Horse-colic Average Table 2: Comparison of classification error rates of decision tree algorithm ID3 with/without discretization algorithms. ID3 Classification Error Rate (%) Multivariate No. No Univariate discretization discretization discretization Average(Relief, MIDCA GainRatio) ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± error 25.78± ± ± ± ± ± ±3.47 Avg 17.75± ± ± ±2.78

8 42 Data Mining VI The experiments results summarized in table 2 and table 3 reveal that MIDCA improves the classification accuracy on average. In table 2, although MIDCA increases the error rate on three datasets for ID3 with univariate and multivariate discretizations respectively, it decreases the error rate on all but only one dataset for ID3 without discretization. In table 3, it improves the performance on all but one dataset for C4.5 with/without univariate discretization and two datasets for C4.5 with multivariate discretization. For the rest of the datasets, MIDCA provides a significant increase in classification accuracy, especially on two datasets: Hypothyroid and Sick-euthyroid, which approached to zero error rates for both learning algorithms. As observed from table 2, MIDCA slightly decreases the performance on three datasets comparing with the ID3 with univariate discretization; similarly, MIDCA increases the error rate on one dataset for C4.5 with univariate discretization in table 3. As we discovered that all these downgrade datasets contain only continuous attributes. This makes worse classification performance, because the MIDCA algorithm needs to carry out a univariate discretization first, prior the multivariate discretization if an interdependent attribute is a continuousvalued attribute too. This extra step increases the uncertainty to the attribute being discretized, hence increases the error rate accordingly. Table 3: Comparison of classification error rates of decision tree algorithm C4.5 with/without discretization algorithms. No. No discretization C4.5 Classification Error rate (%) Multivariate Univariate discretization discretization Average(Relief, GainRatio) MIDCA Avg Moreover, from the average error rate depicted in table 2 and table 3 respectively, our method MIDCA indeed decreases the classification error rate from 17.75%, 16.74% and 14.77% down to 13.27% of ID3 algorithm; and from 15.66%, 15.02% and 14.21% down to 12.33% of C4.5 algorithm, although

9 Data Mining VI 43 several datasets obtained higher error rates versus the average of the algorithms with multivariate discretizations of relief and gain ratio respectively. The improvements relative to both algorithms without discretizations, with univariate and multivariate discretizations reach to approximately 25.2% and 21.3%, 20.7% and 17.9%, and 10.2% and 13.2% respectively. The least improvement is over 10%, this verifies that our algorithm MIDCA that incorporating Relief and Gain Ratio does outperform their original ones, and of course better than the univariate discretization method and no discretization at all. 4 Conclusions and future research In this paper, we have proposed a novel method for multivariate interdependent discretization that focused on the discovery of a best interdependent attribute for each continuous-valued attribute. The method can be used as a preprocessing tool for any learning algorithms, and it ensures at least the binary discretization so that minimizes the information lost and maximizes the classification accuracy. The empirical evaluation results presented in this paper indicate the significant evidence that our method MIDCA can appropriately discretize a continuousvalued attribute with respect to a specific interdependent attribute, thus improves the final classification performance. However, the method has limitation in handling the dataset contains all continuous-valued attributes. If this is the case, the complexity and cost for discovering an interdependent attribute will be increased and the performance of MIDCA will be decreased. Since a perfect matching of an interdependent attribute to a continuous-valued attribute is considered as the key success factor in multivariate interdependent discretization. Our experiments were performed by applying ID3 and C4.5 learning algorithms, for further comparisons, we plan to perform the experiments by other learning algorithms, such as naive-bayes (Langley et al. [19]), or clusters, etc. On the other hand, further research should include investigating the complexity as well as efficiency of the algorithm, and may extend the discretization on more than two attributes. Finally, limitations should be resolved to be able to handle continuous-valued interdependent attribute efficiently and effectively. These addressed research directions may finally guide us to create a valuable algorithm. References [1] Han J. & Kamber M., Data Mining - Concepts and Techniques, Morgan Kaufmann Publishers, [2] Dougherty J., Kohavi R. & Sahami M., Supervised and Unsupervised Discretization of Continuous Features. Proceedings of the Twelfth International Conference, Morgan Kaufmann Publishers, San Francisco, CA [3] Fayyad U. M. & Irani K. B., Multi-interval discretization of continuousvalued attributes for classification learning. Proceeding of the Thirteenth International Joint Conference on Artificial Intelligence, pp , 1993.

10 44 Data Mining VI [4] Liu H. & Setiono R., Feature selection via discretization, Technical report, 1997, Dept. of Information Systems and Computer Science, Singapore. [5] Liu H., Hussain F., Tan C. & Dash M., Discretization: An enabling technique. Topics in Data Mining and Knowledge Discovery, pp , [6] Bay S. D., Multivariate Discretization of Continuous Variables for Set Mining. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , [7] Bay S.D., Multivariate Discretization for Set Mining. Knowledge and Information Systems, 3(4), pp , [8] 北京醫科大學人民醫院心內科高血壓研究組編寫, 高血壓病現代知識百問答,1998. [9] Mitchell T. M., Machine Learning, McGraw-Hill Companies, Inc [10] 朱雪龍, 應用信息論基礎, 清華大學出版社,2000. [11] Kira K. & Rendell L., A practical approach to feature selection. Proc. Intern. Conf. on Machine Learning, Aberdeen, Morgan Kaufmann, pp , [12] Kira K. & Rendell L., The feature selection problem: traditional methods and new algorithm. Proc. AAAI 92, San Jose, CA [13] Kononenko I., On biases in estimating multi-valued attributes. In IJCAI95, pp , [14] Breiman L., Technical note: Some properties of splitting criteria. Machine Learning, 24: pp , [15] Blake C. L. & Merz C. J., UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science html. [16] Quinlan J. R., Induction of decision trees. Machine Learning, 1(1), pp , [17] Quinlan J. R., C4.5: Programs for Machine Learning, San Mateo, CA. Morgan Kaufmann, [18] Quinlan J. R., Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4, pp , [19] Langley P., Iba W. & Thompsom K., An analysis of Bayesian classifiers. In Proceedings of the tenth national conference on artificial intelligence, AAAI Press and MIT Press, pp , 1992.

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules. Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present