The Data Cleaning Problem: Some Key Issues & Practical Approaches. Ronald K. Pearson

Similar documents
Mining Imperfect Data

Relational Nonlinear FIR Filters. Ronald K. Pearson

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

2011 Pearson Education, Inc

BNG 495 Capstone Design. Descriptive Statistics

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Descriptive Data Summarization

Basic Statistical Analysis

Similarity and Dissimilarity

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

GS Analysis of Microarray Data

GS Analysis of Microarray Data

GS Analysis of Microarray Data

AP Statistics Cumulative AP Exam Study Guide

Rigorous Evaluation R.I.T. Analysis and Reporting. Structure is from A Practical Guide to Usability Testing by J. Dumas, J. Redish

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

MAT Mathematics in Today's World

Exploratory data analysis: numerical summaries

Statistics for Managers using Microsoft Excel 6 th Edition

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Chapter 4.notebook. August 30, 2017

Information Management course

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

TOPIC: Descriptive Statistics Single Variable

Descriptive Statistics-I. Dr Mahmoud Alhussami

Why is the field of statistics still an active one?

ANÁLISE DOS DADOS. Daniela Barreiro Claro

Statistics Toolbox 6. Apply statistical algorithms and probability models

Describing distributions with numbers

Statistics and parameters

GS Analysis of Microarray Data

June 12 to 15, 2011 San Diego, CA. for Wafer Test. Lance Milner. Intel Fab 12, Chandler, AZ

Business Statistics. Lecture 10: Course Review

Biased Urn Theory. Agner Fog October 4, 2007

Applying the Q n Estimator Online

Lecture 3: Chapter 3

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

Micro Syllabus for Statistics (B.Sc. CSIT) program

Sections OPIM 303, Managerial Statistics H Guy Williams, 2006

3.2 Numerical Ways to Describe Data

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

Rank parameters for Bland Altman plots

Ensemble Data Assimilation and Uncertainty Quantification

Detection of outliers in multivariate data:

Section 3. Measures of Variation

CHAPTER 2: Describing Distributions with Numbers

1 Measures of the Center of a Distribution

After completing this chapter, you should be able to:

An introduction to biostatistics: part 1

Highly Robust Variogram Estimation 1. Marc G. Genton 2

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

GS Analysis of Microarray Data

Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

3.1 Measure of Center

Statistics I Chapter 2: Univariate data analysis

2.1 Measures of Location (P.9-11)

Elementary Statistics

A comparison of fully Bayesian and two-stage imputation strategies for missing covariate data

Essentials of Statistics and Probability

Unsupervised machine learning

Instrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data

Lecture 1: Descriptive Statistics

Data Analysis and Statistical Methods Statistics 651

Numerical Measures of Central Tendency

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

CHAPTER 1. Introduction

Finite Population Sampling and Inference

Some thoughts about anomaly detection in HEP

MATH 117 Statistical Methods for Management I Chapter Three

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

Statistics I Chapter 2: Univariate data analysis

SUMMARIZING MEASURED DATA. Gaia Maselli

Assumptions in Regression Modeling

Notes on Discriminant Functions and Optimal Classification

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Part 7: Glossary Overview

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Chapter 4. Displaying and Summarizing. Quantitative Data

STA 218: Statistics for Management

Robustness of Principal Components

Nature Biotechnology: doi: /nbt Supplementary Figure 1

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

CMSC858P Supervised Learning Methods

Practical Statistics for the Analytical Scientist Table of Contents

Chapter 1 Descriptive Statistics

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Transcription:

The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University Philadelphia, PA DIMACS Workshop on Data Quality, Data Cleaning and Treatment of Noisy Data November 3-4, 2003 1

Topics 1. Outliers: an important data anomaly - types and working assumptions - some real data examples 2. Detecting outliers - the popular 3σ edit rule - order-statistics vs. moments - some alternative approaches 3. Other data anomalies - missing data - misalignments - noninformative variables - comparing performance 2

Example 1: Outlier in a microarray data sequence Dye swap average of log2 intensity ratios, gene 263 <<-- Outlier Log2 Intensity Ratio -1 0 1 2 3 4 Control EtOH 5 10 15 Sample 3

Example 2: Influence of outliers on a volcano plot Log2 expression change vs. p-value, Genes 201 to 300 Log2 Expression Change -1.0-0.5 0.0 0.5 1.0 0.005 0.010 0.050 0.100 0.500 1.000 t-test P-value 4

Example 3: Bivariate outlier in a simulated dataset NOTE: Outlier is not extreme with respect to either x or y individually y(k) value 0.0 0.2 0.4 0.6 0.8 1.0 OUTLIER -->> 0.0 0.2 0.4 0.6 0.8 1.0 x(k) value 5

Example 4: Bivariate outliers in a real dataset MA plot constructed from an uncorrected microarray dataset Log2 Geometric Mean Intensity Log2 Intensity Ratio 5 10 15 20 25-10 0 10 20 6

Example 5: Multivariate outliers in material property relationships NOTE: Here, outliers correspond to unusually good materials Selectivity 0 2 4 6 8 1980 Limit 1991 Limit FTN s -->> 1 2 3 4 5 6 Permeability 7

Example 6: Common mode outlier example NOTE: Univariate outliers can be highly correlated in different variables Raw Inlet Data Cleaned Inlet Data Property Value, Q(k) 0 50 100 150 200 Property Value, Q(k) 185 190 195 200 0 20 40 60 80 100 Time, k Raw Outlet Data 0 20 40 60 80 100 Time, k Cleaned Outlet Data Property Value, Q(k) 0 50 100 150 200 250 Property Value, Q(k) 240 244 248 252 0 20 40 60 80 100 Time, k 0 20 40 60 80 100 Time, k 8

The 3σ Edit Rule Procedure: Motivation: x k x > 3ˆσ x k is an outlier 1. the Gaussian assumption x k N(µ, σ 2 ) is very popular 2. under this assumption: Prob { x k µ > 3σ} 0.3% History: dates back at least a century: T. Wright, A Treatise on the Adjustment of Observations by the Method of Least Squares, Van Nostrand, 1884 still advocated today: S. Draghici, Data Analysis Tools for DNA Microarrays, Chapman and Hall/CRC, 2003 9

A Spectacular Failure: The flow rate dataset NOTE: This dataset contains 20% visually obvious outliers: none are detected by the 3σ edit rule Flow Rate, R(k) -200 0 200 400 600 800 0 500 1000 1500 2000 2500 Hour, k 10

Why? Basic reason: the mean µ and standard deviation σ are unknown and must be estimated from data standard estimators are extremely sensitive to the presence of outliers Specific observation: At point contamination levels greater than 10%, the 3σ edit rule will fail completely: no outliers will be detected To overcome this problem: 1. replace the mean with an outlier-resistant alternative (e.g., median) 2. replace the standard deviation with an outlier-resistant alternative (e.g., MAD) 11

Outlier Sensitivity of Standard Moment Estimators: Mean, variance, and skewness -2 0 2 4 6 8 10 1st Moment 2nd Moment Error Normalized 3rd Moment 0% 1% 5% 15% 0% 1% 5% 15% 0% 1% 5% 15% 12

Order-based Alternatives: Median, square of interquartile distance, Galton s skewness -2 0 2 4 6 8 10 Median Squared IQD Error Galton s Skewness 0% 1% 5% 15% 0% 1% 5% 15% 0% 1% 5% 15% 13

The Hampel Identifier Idea: replace the mean x with the outlier-resistant median x replace the standard deviation ˆσ with the outlier-resistant MAD scale estimate S The MAD scale estimate: Interpretation: S = 1.4826 median { x k x } d k = x k x measures the distance of each point x k from the reference value x the median d k value tells how far a typical point lies from x the factor 1.4826 makes S an unbiased estimate of σ for Gaussian data 14

The Flow Rate Dataset Revisited The Hampel identifier provides a clean separation between normal operation and shutdown episodes Flow Rate, R(k) -200 0 200 400 600 800 0 500 1000 1500 2000 2500 Hour, k 15

The Boxplot Edit Rule Symmetric version: like Hampel identifier, replace x with x replace ˆσ with the outlier-resistant interquartile distance Q Quartiles: x U = upper quartile 75% of data values lie below this observation x L = lower quartile 25% of data values lie below this observation Q = x U x L Asymmetric version: x k < x L tq lower outlier x k > x U + tq upper outlier 16

Asymmetric Example: The industrial pressure dataset Comparison of three outlier detection rules Three Sigma Rule Hampel Identifier Asymmetric Boxplot Rule 100 200 300 400 500 0 200 400 600 800 100 200 300 400 500 0 200 400 600 800 100 200 300 400 500 0 200 400 600 800 17

Mis... ing Data Problem: some x k values are unavailable ignorable case: increases variability nonignorable case: introduces bias 1936 Literary Digest election poll Autocorrelation example: R xx (k) = 1 S l S x l x l+k S = random subset of {1, 2,..., N} ignorable case: causes increased variability of R xx (k) estimates S = even k only non-ignorable case: cannot estimate R xx (k) for any odd k Additional consequences: missing values can be converted into outliers (storage tank example) missing values can cause misalignments 18

Misalignment: Four corrupted data sequences caused by unexpected blank records NOTE: Difficulty of detection varies strongly from one variable to another Gene Index vs. Record Number Array Row vs. Record Number Gene Index 0 500 1500 2500 Array Row 0.0 1.0 2.0 3.0 1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500 Record Number Record Number X Location 5000 10000 15000 X Location vs. Record Number Record Number 1000 1500 2000 2500 3000 3500 Log2 Intensity 0 5 10 15 20 Log2 Intensity vs. Record Number Record Number 1000 1500 2000 2500 3000 3500 19

The CAMDA Challenge Dataset CAMDA: Critical Assessment of Microarray Data Analysis annual data analysis competition CAMDA 2002 challenge datsets: 1. Latin square Affymetrix benchmark 2. normal mouse cdna microarray study Structure of the normal mouse dataset: derived from 72 individual microarrays 3 organ samples from each of 6 mice 4 microarrays per sample 2 channels per microarray: reference & experimental reformated into three organ-specific summary datasets 20

The CAMDA Challenge Dataset Stivers et al. obtained anomalous results from a preliminary principal components analysis expected clustering: common reference cluster, 3 organ clusters observed: unreasonable splitting of the reference cluster subsequently observed: disagreements of gene ID/slide position combinations between different organ datasets What happened? 1932 of 5304 genes were mis-annoted cause: error in procedure that combined the 72 individual microarray datasets into 3 organ-specific summary datasets 21

Softwear Errors Source of both misalignment examples: 1. inconsistent handling of missing values between Excel and S-plus 2. (Stivers et al.): The data used here were assembled into packages, probably manually using ad hoc database, spreadsheet, or perl script. Under these conditions, it is remarkably easy for the row order to be changed accidentally... Some relevant observations: 1. Wall, et al. (2000): It is a standing joke in the Perl community that the next big stock market crash will probably be caused by a bug in someone s Perl script. 2. Kanert et al. (1999): About one in three attempts to fix a program doesn t work or causes a new problem. 3. Beizer (1990): estimates between 1 and 3 errors per 100 executable statements, after the code has been debugged 22

Noninformative Variables Externally noninformative variables: variables x k that are a priori irrelevant Murphy s law: irrelevant variables sometimes aren t R.W. McClure s example Inherently noninformative variables: completely missing variables constant variables exact duplicate variables Application-irrelevant variables: e.g., variables that become inherently noninformative when analysis is restricted to a subset of interest specific example: anomaly indicator variables in the analysis of nominal data (sometimes:) noise variables Why is this important? 23

A Clustering Example Eight datasets compared: k = 4 well-separated clusters three informative components in each attribute vector x k 0 to 7 non-informative components in x k Clustering procedure: Partitioning Around Medoids (PAM) - Kaufman and Rousseeuw (1987) better-behaved alternative to k-means Performance assessment: average silhouette coefficient (Kaufman and Rousseeuw, 1987) assesses both intracluster cohesion and intercluster separation bounded between 1 (horrible misclassification) and +1 (perfect classification) 24

Clustering Results: Influence of noninformative variables Average silhouette coefficients s k = 2 spurious clustering k = 4 correct clustering Noise s, s, Components k = 2 k = 4 0 0.636 0.750 1 0.619 0.709 2 0.604 0.675 3 0.587 0.638 4 0.579 0.619 5 0.568 0.595 6 0.557 0.573 7 0.548 0.555 25

A Final Example Consider the effects of small deletions: datasets: four different 17 point sequences deletions: all possible 2 point deletions 17 2 = 136 possible 15 point subsets The data sequences: 0: uniformly distributed on [ 1.1, 1.1] 1: 8 points uniformly distributed on [ 1.1, 0.9], one zero value, 8 points uniformly distributed on [0.9, 1.1] 2: middle 5 points of Sequence 0 set to zero (one common missing data model) 3: Sequence 0 with 2 outliers, rescaled into original [ 1.1, 1.1] range The scale estimates: A. the standard deviation ˆσ B. the interquartile distance Q C. the MAD scale estimate S 26

Four Simulated Data Sequences Sequence 0 Sequence 1 Data value, x(k) -1.0-0.5 0.0 0.5 1.0 Data value, x(k) -1.0-0.5 0.0 0.5 1.0 5 10 15 5 10 15 Sample, k Sample, k Sequence 2 Sequence 3 Data value, x(k) -1.0-0.5 0.0 0.5 1.0 Data value, x(k) -1.0-0.5 0.0 0.5 1.0 5 10 15 5 10 15 Sample, k Sample, k 27

Scale Estimates: Consequences of all possible 2-point deletions 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Sequence 0 Sequence 1 Sequence 2 Sequence 3 SD IQD MAD SD IQD MAD SD IQD MAD SD IQD MAD 28

Summary: Three Key Conclusions 1. Unimaginable anomalies infest real datasets Yogi Bera: If something has a 50% chance of happening, then 9 times out of 10 it will. Dasu and Johnson (2003, p. 186): Take NOTHING for granted. The data are never what they are supposed to be, even after they are cleaned up. The schemas, layout, content, and nature of content are never completely known or documented and continue to change dynamically. 2. Different analysis methods exhibit different sensitivities to different data anomalies 3. Comparison of what should be equivalent analyses across different scenarios can be extremely useful in uncovering anomalies 29