Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

Similar documents
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Stephen Scott.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Performance Evaluation

Model Accuracy Measures

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Albert Bifet. April 2012

Evaluation requires to define performance measures to be optimized

Performance Evaluation and Comparison

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Hypothesis Evaluation

Spectral Methods for Subgraph Detection

FINAL: CS 6375 (Machine Learning) Fall 2014

Data Mining und Maschinelles Lernen

Diagnostics. Gad Kimmel

Performance Evaluation and Hypothesis Testing

Learning with multiple models. Boosting.

Terminology for Statistical Data

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Data Mining and Knowledge Discovery: Practice Notes

Performance Evaluation

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Performance. Learning Classifiers. Instances x i in dataset D mapped to feature space:

Anomaly Detection. Jing Gao. SUNY Buffalo

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Learning Theory Continued

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Data Integration for Big Data Analysis for finite population inference

Least Squares Classification

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Analytics for Social Science

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Generalized Linear Models

Global Scene Representations. Tilke Judd

Temporal and spatial approaches for land cover classification.

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Midterm: CS 6375 Spring 2015 Solutions

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Lecture 4 Discriminant Analysis, k-nearest Neighbors

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

ECE521 week 3: 23/26 January 2017

MIRA, SVM, k-nn. Lirong Xia

Introduction to Supervised Learning. Performance Evaluation

Machine Learning Concepts in Chemoinformatics

Classification using stochastic ensembles

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Iterative Laplacian Score for Feature Selection

Pointwise Exact Bootstrap Distributions of Cost Curves

Chapter 14 Combining Models

Types of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics

Introduction to Machine Learning Midterm Exam

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

3.4 Linear Least-Squares Filter

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Applied Machine Learning Annalisa Marsico

Mining Classification Knowledge

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Pivot Selection Techniques

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Overview of IslandPick pipeline and the generation of GI datasets

CS7267 MACHINE LEARNING

Machine Learning for NLP

Machine Learning, Midterm Exam

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Implementing Rubin s Alternative Multiple Imputation Method for Statistical Matching in Stata

ECE 661: Homework 10 Fall 2014

Randomized Algorithms

Stats notes Chapter 5 of Data Mining From Witten and Frank

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC Neural Networks. Perceptron Learning Rule

The Perceptron Algorithm

Evaluation & Credibility Issues

Structure-Activity Modeling - QSAR. Uwe Koch

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Statistical Machine Learning from Data

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear and Logistic Regression. Dr. Xiaowei Huang

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Generative v. Discriminative classifiers Intuition

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

A Note on Bayesian Inference After Multiple Imputation

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

TDT4173 Machine Learning

Data Mining algorithms

Transcription:

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data Vadim Ayuyev, Joseph Jupin, Philip Harris and Zoran Obradovic Temple University, Philadelphia, USA 2009

Real Life Data is Often Missing Some Values 2

Outline 3

Notation Missing values are assumed to be completely at random x x i ms i i, ms i, 4 - i-th instance - i-th instance that has missing in the processing attribute x x P i - -th attribute in x - -th missing attribute in x - number of non-missing values in the i-th attributes C - cluster constructed for ms i, i, ( Ci, ) x - i-th instance of cluster C i i x ms i i, dst nm d δ ( n) i, ( n) i, ( xi, x ) i, - distance between x and x - number of common nearest neighbors between x and x i - distance in the n-th attribute between instances x and x i - missing value indicator for the n-th attribute in the instances x and x M - total number of instances N - total number of attributes i i

A Simple Approach:Mean/Mode Replacement For categorical data: For continuous data: Advantages 5 One of the simplest approaches x x = mode x ms i, p= 1K P p, 1 = x ms i, p, P p = 1 Was the most commonly used in practice (no longer recommended) Good MSE results (especially when one of the values dominates in the certain attribute) Limitations Introduced bias by reducing variance of the attribute Worst accuracy in presence of outliers or/and uniform distribution of the values in the attribute. Is not using background information (e.g., other attributes P

Replacement by Linear Regression using non-missing data to predict the values of missing data all instances with the same values for non-missing data will have identical estimates for missing values Replacement by Predictive Mean Matching imputes a value randomly from a set of observed values whose predicted values are closest to the predicted value for the missing value from the simulated regression model (Heitanand Little, 1991)

Replacement by Multiple Imputation Generate multiple simulated values for each incomplete instance (by Monte Carlo methods) Analyze datasets with each simulated value substituted in turn Combine the results and report the average as the overall estimate Advantages: Theoretically well founded (Little and Rubin, 1987) Reliable estimates for continuous attributes and a fairly small fraction of missing values (Schafer, 1999) 7

Example: 5 examples with 6 attributes and 1 missing value Mode based imputation: ms=1 1 10.2 1 1 ms 1 1 9.8 1 1 2 1 0 1.1 0 0 1 0 0 1.1 0 0 1 1 1 0.3 0 0 1 0 Regression, Predictive mean matching and Multiple Imputation are not much different here as they do not exploit the fact that the first two examples are so similar (ms=2 is reasonable) 8 DCI basic idea: Around each instance with a missing value construct an independent cluster of similar instances without missing values for a particular attribute and use this subset for imputation.

DCI Approach: Steps Similarity matrix construction K shared nearest neighbors matrix construction Process missing value Impute missing value Recalculate 9

DCI: Basics 1. Similarity matrix (SM) construction: 10 SM where dst ( x1, x2 ) dst ( x1, xm ) ( x, x ) dst ( x, x ) dst M M O M dst ( M, 1 ) dst ( M, 2 ) x x x x 2 1 2 M = dst δ ( n) i, ( x, x ) = N d n= 1 i N n= 1 ( n) ( n) i, i, δ δ ( n) i, 0, if one of xi, n or x, n is missing; = 1, otherwise

DCI: Basics for continuous attribute (n) for categorical attribute (n) 2. K shared nearest neighbors matrix (NM) construction: 0 nm1,2 nm1, M nm2,1 0 nm2, M NM = M M O M nmm,1 nmm,2 0 11 d ( n) i, n, i, x x n = max xp, n min x p, n p= 1K Pn p= 1K Pn d ( n) i, n, i, 1, if x x n; = 0, if xi, n = x, n

DCI: Basics K=4, M=10, nm(x i,x,k)=1 x x i 12 list i list (,, ), nm x x K = nm = list list i i i NOTE: Both SM and NM are symmetrical according to the principal diagonal: ( xi, x, ) = ( x, xi, ) ( xi, x ) = dst ( x, xi ) nm K nm K dst

DCI: Basics 3. Processing each missing value: a) Form a list (list i, ) from all the instances with no missing in the attribute. b) Sort (ascending) list i, according to c) Form a cluster C i, from the first R elements of list i,. d) Calculate an imputation for the cluster C i,. 4. Recalculate SM and NM: 13 dst ( ms x, ) i x p nm i, p ms x i,, where p = 1, P ; nm > 0 a) Options: after each imputation; after processing all missing values in a certain attribute; no recalculation. i, p

DCI: Methods for Data Imputation Categorical attributes: ( ) ms ( i ) ( ), i such that max,, x = x C nm x x C i, n r, n i r r= 1K R Continuous attributes: x ms i, n = R r= 1 ( ( ) ) C ( ) i, Ci, nm xi, xr xr, n R r= 1 ( ( ) ) Ci, nm xi, xr 14

Evaluation: Test Data and Alternative Approaches Adult database: 46043 instances. 12 attributes (6 categorical, 6 continuous). Test sets: 8 constructed by randomly hiding 0.2%-32.6% of data elements. Compared to: Multiple Imputations (Amelia II v.1.6). Linear Regression, Multilevel Regression, Predictive Mean Matching (WinMICE 1.8). Mean/Mode Replacement by WEKA 3.6. Random Replacements (MatLab 7.5). 15

Measures of Imputation Quality Number of Incorrect Imputations (for categorical data): NII s = 100% Q Relative Imputation Accuracy (for continuous data): 16 n τ nτ RIAτ = 100% Q - number of imputed elements estimated within τ% from the true values; τ - tolerance persent; Q - total number of imputed values in the data; s - number of incorrectly estimated imputed elements.

DCI: Influence of Number of Nearest Common Neighbors (K) for Cluster Size R=9 For imputation of categorical values: Large K is an appropriate choice for >5% missing values. Smaller K is recommended for low fraction of missing values. 17

DCI: Influence of K for Continuous Data For imputation of continuous values: Large K is an appropriate choice for 16.3% missing elements. In the most cases low K is recommended. 18

DCI: Influence of Cluster Size R for Large K on Categorical Data Imputation Imputation quality of categorical values is invariant in cluster size R for large number of nearest common neighbors 19

DCI: Influence of Cluster Size R for Large K on Imputation of Continuous Attributes For imputation of continuous variables: Small R value is recommended. For the most fractions of missing data results are varied within 3% of absolute accuracy obtained for the same cluster size R. 20

Comparison to Alternative Imputation Methods: Categorical Attributes For imputation of categorical values: 21 DCI was much more accurate than the alternative six imputation methods. Most of the alternatives provided >50% more error on categorical attributes.

Comparison to Alternative Imputation Methods: Continuous Attributes DCI was much more accurate than the alternative six imputation methods. Mean/Mode replacement method is totally failed. 22

Imputation for Classification: Problem Description and Quality Measure Classification task: Based on 12 attributes classify a person by average year income (more or less than 50,000 US dollars per year) Data sets: Training set with no missing contains 16,043 instances. Test set with missing elements contains 30,000 instances. Classification models: NB-Trees; Random Subspace Selection; Multilayer Perceptron. Quality measure (Classification Error): 23 false classified CE = 100% total instances

Evaluation of Different Imputation Methods for Classification: Classification Error 24 Conclusions: DCI is always better than other approaches. Due to the similar results, DCI is better to use with a high fraction of missing data.

Class Misbalance: Problem Data sets quality: Training set: 12,092 subects in one class vs. 3,951 in other (3:1). Test set: 22,529 subects in one class vs. 7,471 in other (3:1). Confusion matrix: Classifier ( H ) Etalon Data ( E) 1 0 1 TP FP HP 0 FN TN HN EP EN V FN + FP CE = 100% V CE does not count for confusion matrix elements relationship to each other. 25

Class misbalance: Solution F-score: F 1 precision accuracy = 2 = precision + accuracy = 2 TP EP + HP Kappa-statistics Classifier ( H ) Etalon Data ( E) 1 0 1 TP FP HP 0 FN TN HN EP EN V observed agreement chance agreement κ = = 1 chance agreement V ( TP + TN) HN EN HP EP = 2 V HN EN HP EP 26

F-score Based Evaluation of Different Imputation Methods for Classification 27 Conclusions: DCI was always better than alternatives DCI provided almost the same classification accuracy as when using completel data.

Kappa-statistics Based Evaluation of Different Imputation Methods for classification: Conclusions: DCI was always better than other approaches. Results were the same as for F- score. More robust Kappa-statistics, showed larger difference among DCI and other methods. 28

Summary: DCI features Advantages Deterministic algorithm. High quality of imputation in mixed-type data. Relative insensibility to the fraction of missing elements in data. Limitations K and R parameters influence the imputation results. High computational complexity -O(M 3 logm). High memory consumption -O(M 2 ) 29

Thank you! More information: http://www.ist.temple.edu Contact: Zoran Obradovic, director Information Science and Technology Center Temple University +1 215-204-6265 zoran@ist.temple.edu