Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data Vadim Ayuyev, Joseph Jupin, Philip Harris and Zoran Obradovic Temple University, Philadelphia, USA 2009

Real Life Data is Often Missing Some Values 2

Outline 3

Notation Missing values are assumed to be completely at random x x i ms i i, ms i, 4 - i-th instance - i-th instance that has missing in the processing attribute x x P i - -th attribute in x - -th missing attribute in x - number of non-missing values in the i-th attributes C - cluster constructed for ms i, i, ( Ci, ) x - i-th instance of cluster C i i x ms i i, dst nm d δ ( n) i, ( n) i, ( xi, x ) i, - distance between x and x - number of common nearest neighbors between x and x i - distance in the n-th attribute between instances x and x i - missing value indicator for the n-th attribute in the instances x and x M - total number of instances N - total number of attributes i i

A Simple Approach:Mean/Mode Replacement For categorical data: For continuous data: Advantages 5 One of the simplest approaches x x = mode x ms i, p= 1K P p, 1 = x ms i, p, P p = 1 Was the most commonly used in practice (no longer recommended) Good MSE results (especially when one of the values dominates in the certain attribute) Limitations Introduced bias by reducing variance of the attribute Worst accuracy in presence of outliers or/and uniform distribution of the values in the attribute. Is not using background information (e.g., other attributes P

Replacement by Linear Regression using non-missing data to predict the values of missing data all instances with the same values for non-missing data will have identical estimates for missing values Replacement by Predictive Mean Matching imputes a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model (Heitanand Little, 1991)

Replacement by Multiple Imputation Generate multiple simulated values for each incomplete instance (by Monte Carlo methods) Analyze datasets with each simulated value substituted in turn Combine the results and report the average as the overall estimate Advantages: Theoretically well founded (Little and Rubin, 1987) Reliable estimates for continuous attributes and a fairly small fraction of missing values (Schafer, 1999) 7

Example: 5 examples with 6 attributes and 1 missing value Mode based imputation: ms=1 1 10.2 1 1 ms 1 1 9.8 1 1 2 1 0 1.1 0 0 1 0 0 1.1 0 0 1 1 1 0.3 0 0 1 0 Regression, Predictive mean matching and Multiple Imputation are not much different here as they do not exploit the fact that the first two examples are so similar (ms=2 is reasonable) 8 DCI basic idea: Around each instance with a missing value construct an independent cluster of similar instances without missing values for a particular attribute and use this subset for imputation.

DCI Approach: Steps Similarity matrix construction K shared nearest neighbors matrix construction Process missing value Impute missing value Recalculate 9

DCI: Basics 1. Similarity matrix (SM) construction: 10 SM where dst ( x1, x2 ) dst ( x1, xm ) ( x, x ) dst ( x, x ) dst M M O M dst ( M, 1 ) dst ( M, 2 ) x x x x 2 1 2 M = dst δ ( n) i, ( x, x ) = N d n= 1 i N n= 1 ( n) ( n) i, i, δ δ ( n) i, 0, if one of xi, n or x, n is missing; = 1, otherwise

DCI: Basics for continuous attribute (n) for categorical attribute (n) 2. K shared nearest neighbors matrix (NM) construction: 0 nm1,2 nm1, M nm2,1 0 nm2, M NM = M M O M nmm,1 nmm,2 0 11 d ( n) i, n, i, x x n = max xp, n min x p, n p= 1K Pn p= 1K Pn d ( n) i, n, i, 1, if x x n; = 0, if xi, n = x, n

DCI: Basics K=4, M=10, nm(x i,x,k)=1 x x i 12 list i list (,, ), nm x x K = nm = list list i i i NOTE: Both SM and NM are symmetrical according to the principal diagonal: ( xi, x, ) = ( x, xi, ) ( xi, x ) = dst ( x, xi ) nm K nm K dst

DCI: Basics 3. Processing each missing value: a) Form a list (list i, ) from all the instances with no missing in the attribute. b) Sort (ascending) list i, according to c) Form a cluster C i, from the first R elements of list i,. d) Calculate an imputation for the cluster C i,. 4. Recalculate SM and NM: 13 dst ( ms x, ) i x p nm i, p ms x i,, where p = 1, P ; nm > 0 a) Options: after each imputation; after processing all missing values in a certain attribute; no recalculation. i, p

DCI: Methods for Data Imputation Categorical attributes: ( ) ms ( i ) ( ), i such that max,, x = x C nm x x C i, n r, n i r r= 1K R Continuous attributes: x ms i, n = R r= 1 ( ( ) ) C ( ) i, Ci, nm xi, xr xr, n R r= 1 ( ( ) ) Ci, nm xi, xr 14

Evaluation: Test Data and Alternative Approaches Adult database: 46043 instances. 12 attributes (6 categorical, 6 continuous). Test sets: 8 constructed by randomly hiding 0.2%-32.6% of data elements. Compared to: Multiple Imputations (Amelia II v.1.6). Linear Regression, Multilevel Regression, Predictive Mean Matching (WinMICE 1.8). Mean/Mode Replacement by WEKA 3.6. Random Replacements (MatLab 7.5). 15

Measures of Imputation Quality Number of Incorrect Imputations (for categorical data): NII s = 100% Q Relative Imputation Accuracy (for continuous data): 16 n τ nτ RIAτ = 100% Q - number of imputed elements estimated within τ% from the true values; τ - tolerance persent; Q - total number of imputed values in the data; s - number of incorrectly estimated imputed elements.

DCI: Influence of Number of Nearest Common Neighbors (K) for Cluster Size R=9 For imputation of categorical values: Large K is an appropriate choice for >5% missing values. Smaller K is recommended for low fraction of missing values. 17

DCI: Influence of K for Continuous Data For imputation of continuous values: Large K is an appropriate choice for 16.3% missing elements. In the most cases low K is recommended. 18

DCI: Influence of Cluster Size R for Large K on Categorical Data Imputation Imputation quality of categorical values is invariant in cluster size R for large number of nearest common neighbors 19

DCI: Influence of Cluster Size R for Large K on Imputation of Continuous Attributes For imputation of continuous variables: Small R value is recommended. For the most fractions of missing data results are varied within 3% of absolute accuracy obtained for the same cluster size R. 20

Comparison to Alternative Imputation Methods: Categorical Attributes For imputation of categorical values: 21 DCI was much more accurate than the alternative six imputation methods. Most of the alternatives provided >50% more error on categorical attributes.

Comparison to Alternative Imputation Methods: Continuous Attributes DCI was much more accurate than the alternative six imputation methods. Mean/Mode replacement method is totally failed. 22

Imputation for Classification: Problem Description and Quality Measure Classification task: Based on 12 attributes classify a person by average year income (more or less than 50,000 US dollars per year) Data sets: Training set with no missing contains 16,043 instances. Test set with missing elements contains 30,000 instances. Classification models: NB-Trees; Random Subspace Selection; Multilayer Perceptron. Quality measure (Classification Error): 23 false classified CE = 100% total instances

Evaluation of Different Imputation Methods for Classification: Classification Error 24 Conclusions: DCI is always better than other approaches. Due to the similar results, DCI is better to use with a high fraction of missing data.

Class Misbalance: Problem Data sets quality: Training set: 12,092 subects in one class vs. 3,951 in other (3:1). Test set: 22,529 subects in one class vs. 7,471 in other (3:1). Confusion matrix: Classifier ( H ) Etalon Data ( E) 1 0 1 TP FP HP 0 FN TN HN EP EN V FN + FP CE = 100% V CE does not count for confusion matrix elements relationship to each other. 25

Class misbalance: Solution F-score: F 1 precision accuracy = 2 = precision + accuracy = 2 TP EP + HP Kappa-statistics Classifier ( H ) Etalon Data ( E) 1 0 1 TP FP HP 0 FN TN HN EP EN V observed agreement chance agreement κ = = 1 chance agreement V ( TP + TN) HN EN HP EP = 2 V HN EN HP EP 26

F-score Based Evaluation of Different Imputation Methods for Classification 27 Conclusions: DCI was always better than alternatives DCI provided almost the same classification accuracy as when using completel data.

Kappa-statistics Based Evaluation of Different Imputation Methods for classification: Conclusions: DCI was always better than other approaches. Results were the same as for F- score. More robust Kappa-statistics, showed larger difference among DCI and other methods. 28

Summary: DCI features Advantages Deterministic algorithm. High quality of imputation in mixed-type data. Relative insensibility to the fraction of missing elements in data. Limitations K and R parameters influence the imputation results. High computational complexity -O(M 3 logm). High memory consumption -O(M 2 ) 29

Thank you! More information: http://www.ist.temple.edu Contact: Zoran Obradovic, director Information Science and Technology Center Temple University +1 215-204-6265 zoran@ist.temple.edu