Data Screening and Adjustments. Data Screening for Errors

Size: px

Start display at page:

Download "Data Screening and Adjustments. Data Screening for Errors"

Christiana Stanley
6 years ago
Views:

Purpose: ata Screening and djustments P etect and correct data errors P

and handle outliers ata Screening for rrors P xamine summary statistics

1 Purpose: ata Screening and djustments P etect and correct data errors P etect and treat missing data P etect and handle insufficiently sampled variables (e.g., rare species) P onduct transformations and standardizations P etect and handle outliers ata Screening for rrors P xamine summary statistics (e.g., n, mean, min, max) and check for irregularities Where did all the data go? Unrealistic value? ction: correct errors in the raw data

2 ata Screening for Missing ata P valuate amount and pattern of missing data and take corrective action, if needed: e.g., Median replacement ction: replace with prior knowledge; insert means or medians; use regression to estimate values ata Screening for Sufficiency P heck for and drop insufficient variables <.g., rare species in community datasets Sufficiency is the extent to which each variable, e.g., each species ecological character, is accurately and meaningfully described by the data..g., species with very few records are not likely to be accurately placed in ecological space. You must decide at what level of frequency of occurrence you want to accept the message and eliminate species below this level.

3 ata Screening for Sufficiency P Other issues: < Influence of abundant generalists in community datasets bundant generalists define strong dimensions of the data cloud that have no meaningful pattern on them. They can overwhelm the message of rarer species in some types of analysis. You must decide whether to include or exclude these dominant species. < Variables with too little variation (i.e., no signature) Variables with too little variation have no meaningful pattern (or influence) and are therefore unnecessary. ata Screening for Sufficiency Typical community dataset ominant species 9% occurrence Rare species Median occurrence % occurrence

4 ata Screening for Sufficiency Some Rules of Thumb P rop insufficient variables (species) and conduct sensitivity analysis < Rare species (e.g., <% occurrence) < Too little variability (e.g., <-% V) Too few occurrences? 7 Too little variation? ata Screening for Sufficiency Some Rules of Thumb P rop abundant generalist species and conduct sensitivity analysis < ominant species (e.g., >9% occurrence) Too ubiquitous? 8

5 ata Transformations & Standardizations Purpose: P Statistical < Improve assumptions of normality, linearity, homogeneity of variance, etc. < Make units of variables comparable when measured on different scales. P cological < Make ecological distance measures work better. < Reduce effect of total quantity in sample units, to put focus on relative quantities. < qualize (or otherwise alter) the relative importance of variables (e.g., common and rare species). < mphasize informative variables (species) at the expense of uninformative variables (species). 9 ata Transformations & Standardizations F Log Transformation b ij =log(x ij +) F olumn Z-score Standardization b ij =(x ij - j )/s j F Transformations are applied to each element of the data matrix, independent of the other elements. Standardizations adjust matrix elements by a row or column standard (e.g., max, sum, etc.).

Monotonic Transformations When to Transform? P To adjust for highly skewed variables P To better meet assumptions of statistical test (e.g., normality, constant variance, etc.

P epends on type of data P Whichever works best Monotonic Transformations F 7 b ij =x ij (power) F inary presence/absence Transformation b ij =x ij (power)

6 Monotonic Transformations When to Transform? P To adjust for highly skewed variables P To better meet assumptions of statistical test (e.g., normality, constant variance, etc.) P To emphasize presence/absence (nonquantitative) signature Which Transformation? P epends on type of data P Whichever works best Monotonic Transformations F 7 b ij =x ij (power) F inary presence/absence Transformation b ij =x ij (power) cceptable omain of x: ll Range of f(x): and only P onverts quantitative data into nonquantitative data P pplicable for species data P Most useful when there is little quantitative information present P an be a severe transformation

7 Monotonic Transformations F b ij =log(x ij +) ? F Log Transformation b ij =log(x ij +) cceptable omain of x: > Range of f(x): ll P ompresses high values and spreads low values by expressing values as orders of magnitude P Useful when high degree of variation; ratio of largest to smallest >; highly positively skewed data Monotonic Transformations Log Transformation b ij =log(x ij +) T?

8 Monotonic Transformations F b ij =x ij ½ (power) F Square Root Transformation b ij =x ij ½ (power) cceptable omain of x: $ Range of f(x): $ P Similar in effect to, but less dramatic than, the log transformation P Often used with count (meristic) data; e.g., when mean equals the variance (Poisson distribution) Monotonic Transformations b 8 Power Transformations p=/ p=/ p=/ p=/ p=/ 7 x 8 9 Power Family Transformation b ij =x ij /p cceptable omain of x: $ Range of f(x): $ P ifferent exponents change the effect of the transformation; the smaller the exponent, the more compression applied to high values P Flexible transformation useful for a wide variety of data

Monotonic Transformations Power Family Transformation b ij =x ij /p 7 Monotonic Transformations F....8...9.

rcsin Square Root Transformation b ij =(/π)*sin - (x ij½ ) cceptable omain of x: - Range of f(x): - P Spreads end of the

9 Monotonic Transformations Power Family Transformation b ij =x ij /p 7 Monotonic Transformations F b ij =(/π)*sin - (x ij½ ) F rcsin Square Root Transformation b ij =(/π)*sin - (x ij½ ) cceptable omain of x: - Range of f(x): - P Spreads end of the scale while compressing the middle for proportion data P Useful for proportion data with positive skew (can use arcsine transformation for negative skew) 8

10 Monotonic Transformations rcsin Square Root Transformation b ij =(/π)*sin - (x ij½ ) T? 9 Monotonic Transformations Some Rules of Thumb P Use a log or square root transformation for highly skewed data or ranging over several (>) orders of magnitude P Use arcsine squareroot transformation for proportion data P If applied to related variable set (e.g., species), then use same transformation (e.g., log) so that all are scaled the same; otherwise, transform independently P onsider binary (presence/absence) transformation when: < percent zeros high (say >%) < number of distinct values low (say < ) < eta diversity high (say >) S s

11 F b ij =x ij / max(x i ) Standardizations F When to Standardize? P To place on equal footing highly unequal sample units or variables (species) P To better represent the patterns of interest Which Standardization? P epends on objective (sample or variable adjustment) and statistical technique (ordination, cluster, etc.)? P Which standard (variance, totals, max, etc.) makes sense? Standardizations F b ij =(x ij - j )/s j F b ij =(x ij - i )/s i F P Standardizations adjust matrix elements by a row or column standard (e.g., max, sum, etc.). P ll standardizations can be applied to either rows or columns (or both)

12 olumn or Row Standardizations? F 7 olumn Standardization P When the principal concern is to adjust for differences (e.g., variances, total abundance, ubiquity) among variables (species) in order to place them on equal footing. P When the focus is on the profile across sample units. Row Standardization P When the principal concern is to adjust for differences (e.g., total abundance, diversity) among sample units in order to place them on equal footing. P When the focus is on the profile within a sample unit. ommon Standardizations P...divide by margin total P Max...divide by margin maximum P Range...standardize values to range - P Frequency...divide by margin maximum and multiply by number of non-zero items, so that the average of nonzero items is P Hellinger...square root of method=total P Normalization...make margin sums of squares equal P Standardize...scale to zero mean and unit variance (zscores) P hi.square...divide by row sums and square root of column sums, and adjust for square root of matrix total

13 Standardizations F b ij =(x ij - j )/s j F olumn Z-score Standardization b ij =(x ij - j )/s j cceptable omain of x: ll Range of f(x): ll P onverts data to z-scores (mean=, variance=) P ommonly used to place variables on equal footing P ssential when variables have different scales or units of measurement Standardizations F b ij =x ij / x j F olumn Standardization b ij =x ij / x j cceptable omain of x: $ Range of f(x): - P ommonly used with species data to adjust for unequal abundances among species P qualizes areas under curves of species response profiles P Relative abundance profiles of samples depends on species relative abundances across all sites

14 Standardizations F b ij =x ij / max(x j ) F olumn Max Standardization b ij =x ij / max(x j ) cceptable omain of x: $ Range of f(x): - P Similar to column total, except: P qualizes heights of peaks of species response curves P ased on extreme values which can introduce noise P an exacerbate importance of rare species 7 Standardizations qualizes area under curve Frequency.... olumn Standardization Frequency Species Species bundance (count) olumn Max Standardization bundance (count) qualizes peaks of curves Frequency bundance (count) 8

15 Standardizations F b ij =x ij / x i F.. Row Standardization b ij =x ij / x i cceptable omain of x: $ Range of f(x): - P ommonly used with species data to adjust for unequal abundances among sample units P qualizes areas under curves of sample unit profiles P Shifts emphasis to relative abundance within a sample unit P Relative abundance profiles of samples are independent 9 Standardizations F b ij =x ij / max(x i ) F Row Max Standardization b ij =x ij / max(x i ) cceptable omain of x: $ Range of f(x): - P Similar to row total; except: P qualizes heights of peaks of sample unit profiles P ased on extreme values which can introduce noise

16 Standardizations F 7 F b.... ij =col max F b ij =row..total Wisconsin ouble Standardization cceptable omain of x: $ Range of f(x): - P st standardize by species (col) maxima, then by row totals P qualize emphasis among sample units and among species P ppealing, but comes at cost of diminishing the intuitive meaning for individual data values Standardizations Some Rules of Thumb P ffect of standardization on analysis depends on variability among rows and/or columns

17 Standardizations Some Rules of Thumb F 7 P onsider row standardizations for species data sets, commonly: < Row normalize (uclidean distance () = chord distance) (Legendre and Gallagher ) F.. < Row chi.square ( = chi.square distance of /) < Row total ( = species profile distance) < Row hellinger ( = Hellinger distance) Standardizations Some Rules of Thumb F 7 b ij =(x ij - j )/s j F P onsider column standardizations to equalize variables measured in different units and scales, commonly: < olumn standardize (z-scores = zero mean and unit variance) < olumn normalize (uncentered with unit variance) < olumn total (col sums = ) < olumn range (col range -)

18 Standardizations Some Rules of Thumb P Standardizations may not matter depending on subsequent analysis, e.g.,: < Principal components of correlation matrix has built in column standardization < orrespondence analysis of species data set has essentially a built in chi-square standardization P No theoretical basis for selecting the best standardization - should justify on biological grounds and perhaps conduct sensitivity analysis ata Screening for Outliers P What are outliers? < Sample units with extreme values for individual variables (univariate outliers) or sample units with unusual combination of values for more than one variable (mulitvariate outliers). P Why worry about outliers? < Outliers can have a large effect on the outcome of an analysis and therefore can lead to erroneous conclusions.

ata Screening for Outliers P Univariate outliers: < xamine sample standard deviation scores on each variable separately. Standard deviation scores > MGO MRO H 8..9 N N 8. N N N 8.7. N.9 8.

19 ata Screening for Outliers P Univariate outliers: < xamine sample standard deviation scores on each variable separately. Standard deviation scores > MGO MRO H 8..9 N N 8. N N N 8.7. N.9 8. N N N 8 N N N. 87 N N N N 89 N N N. 9 N N N N 9 N N.7 N xtreme observations 7 ata Screening for Outliers P Multivariate outliers: < xamine deviations of the sample average distances to other samples. Standard deviation scores > xtreme observations 8

20 ata Screening for Outliers P Multivariate outliers: < xamine each sample s Mahalanobis distance to the group of remaining samples. 9 ata Screening for Outliers P Multivariate outliers: < xamine results of subsequent analyses for extreme values (e.g., isolated points in ordination plots, single-member clusters in cluster analysis, etc.) P P P

21 ata Screening for Outliers Some Rules of Thumb P xamine data at all stages of analysis (i.e., input data, transformed/standardized data, ecological distance matrix, results of analysis) for extreme values P e aware of potential impact of extreme values in chosen analysis P elete extreme values only if justifiable on ecological grounds P onduct sensitivity analysis

Algebra of Principal Component Analysis

Algebra of Principal Component Analysis 3 Data: Y = 5 Centre each column on its mean: Y c = 7 6 9 y y = 3..6....6.8 3. 3.8.6 Covariance matrix ( variables): S = -----------Y n c ' Y 8..6 c =.6 5.8 Equation