Correlation Preserving Unsupervised Discretization. Outline

Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization Results Summary/Conclusion 1

Paper References S. Mehta, S. Parthasarathy, and H. Yang. Toward Unsupervised Correlation Preserving Discretization, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 8, 2005. S. Mehta, S. Parthasarathy, and H. Yang. Correlation Preserving Discretization, Proceedings of the Fourth IEEE International Conference on Data Mining, 2004. What is discretization? Data Continuous (aka quantitative) (i.e. weight (lbs): 155.2, 160.2, 199.3) Categorical (aka quantitative) (i.e. color: red, blue, green, white) Discretization Transformation of continuous variable to categorical variable i.e. weight (lbs): 155.2, 160.2, 199.3 -> low, medium, high Consequence: lose information (i.e. correlation) Dimensions of discretizations Unsupervised vs supervised Unsupervised: no class label» Only 2 main algorithms known: equal-width and equal-frequency Supervised: class label» Many algorithms exists Dynamic vs static Dynamic: while learning takes place Static: before learning takes place Local vs global Local: subset of variables at a time Global: all variables at a time Top-down vs bottom-up Top-down: start with empty set of cutpoints and add cutpoints Bottom-up: start with all data points as cutpoints and merges Why even bother with discretization? Many learning algorithms accept data set in only categorical form 2

Motivation What if you had mixed-data type in your dataset, had no class label, and want the discretization to preserve correlation in the continuous domain in the categorical domain? Are there algorithms that aim to solve this problem? S.D. Bay, Multivariate Discretization for Set Mining, Knowledge and Information Systems, Vol. 3, No. 4, pp. 491-512, 2001. No dimension reduction; computationally expensive M.C. Ludl and G. Widmer, Relative Unsupervised Discretization for Association Rule Mining, Proc. Second Int l Conf. Principles and Practice of Knowledge Discovery in Databases, pp. 148-158, 2000. No consideration of interdependence between variables Mehta et al Use of PCA to deal with high dimensionality PCA generates a set of n orthogonal vectors from the input data set with a dimensionality N, where n < N and n orthogonal directions preserve most of the variance in the input data Use of association mining to handle mixed-data type Principal Component Analysis PCA generates a set of n orthogonal vectors from the input data set with a dimensionality N, where n < N and n orthogonal directions preserve most of the variance in the input data Steps Generate n x n correlation matrix Get the eigenvectors and eigenvalues of correlation matrix Eigenvectors represent orthogonal axis Eigenvalue represent variance corresponding to its eigenvector Retain k (k < n) eigenvectors that correspond to the largest eigenvalues which add up to 90% 3

Association Mining Association mining generates association rules Let I = { i 1, i 2, i 3,, i m } be a set of m distinct attributes (items) Let each transaction, T, in a database, D, contain a set of items such that T is a subset of I Association rule is an expression of the form A => B, where A, B are proper subsets of I, and A B = 0. Each itemset is said to have support S if S percent of T in D contain the itemset Similarity metric of association patterns generated by two data set (or two samples of the same data set) Let A and B be two frequent itemsets for database samples d1 and d2, respective. For an element x in A, let supp d1 (x) be the frequency of x in d1 and for an element x in B, let supp d2 (x) be the frequency of x in d2 sim( d1, d2) = x A B max(0,1 α sup d1 A B ( x) sup d 2 ( x) ) Correlation Preserving Discretization 1. Normalization and mean centralization 2. Eigenvector computation 3. Data projection onto eigenspace 4. Discretization in eigenspace 1. Only continuous: apply clustering to generate n cutpoints 2. Mixed-data: 1. Compute frequent itemsets from all categorical variables, A 2. Split eigendimension into equal-frequency and compute frequent itemsets in each interval subject to being subset of A 3. Compute similarity between contiguous intervals and merge (threshold must be defined) 5. Correlate original dimensions with eigenvectors 6. Reproject eigen cut-points to original dimensions 1. K-NN: 1. find knn points to cutpoint in eigenspace, take mean/median of knn points and cutpoint in original representation space 2. Direct Projection: 1. cos(θ ij ) = e i x o i, where e i is the i-th eigenvector and o i is a N-dimensional unit vector along the j-th dimension, multiply cos(θ ij ) with each cutpoint 4

Results - Dataset Results Compared to MVD and ME-MDL 5

Results Compared to MVD and ME-MDL Correlation preserving discretization is more meaningful Population within an interval should exhibit similar properties Population in different intervals should exhibit different properties Intuitive cutpoints Adult Age: cutpoints similar to MVD; meaningful cutpoints (marriage, retirement, education) Adult Capitial Gain: (low, medium, high) Adult Capital Loss: (binary) Adult Hours/week: (age correlation to hours) Results - Classification 6

Results - Classification Results - Classification Compared to other classifiers C4.5, IBK, PART, Bayes, ONER, Kernelbased, SMO Show lowest error rate in 8 of 13 datasets Up to 30% missing at random datasets did not produce significant differences in error rate 7

Summary/Conclusion This methods uses and preserves correlation between all variables of mixedtype data to discretize Discretization is meaningful and intuitive Promising results in classification and missing data problems 8