Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

PREP Course #10: Introduction to Exploratory Data Analysis and Data Transformations (Part 1) Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

CME Disclosure Statement The North Shore LIJ Health System adheres to the ACCME s new Standards for Commercial Support. Any individuals in a position to control the content of a CME activity, including faculty, planners, and managers, are required to disclose all financial relationships with commercial interests. All identified potential conflicts of interest are thoroughly vetted by the North Shore-LIJ for fair balance and scientific objectivity and to ensure appropriateness of patient care recommendations. Course Director and Course Planners, Kevin Tracey, MD, Cynthia Hahn, Emmelyn Kim, MPH, Tina Chuck, MPH have nothing to disclose. Martin L Lesser, PhD, EMT-CC have nothing to disclose

Quick Review Measures of location mean, median, quartiles, quantiles Measures of spread range, standard deviation, interquartile range, interquantile range Quick displays of data stem-and-leaf plot, box (and whisker) plot 3

LOS for 54 Pneumonia Patients Hypothetical Example Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 Frequency Distribution Cumulative Cumulative LOS Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 3-4 days 6 11.11 6 11.11 5-6 days 10 18.52 16 29.63 7-8 days 12 22.22 28 51.85 9-10 days 9 16.67 37 68.52 11-12 days 4 7.41 41 75.93 13-14 days 4 7.41 45 83.33 15-16 days 2 3.70 47 87.04 17-18 days 1 1.85 48 88.89 21-22 days 1 1.85 49 90.74 25-26 days 1 1.85 50 92.59 27-28 days 1 1.85 51 94.44 29-30 days 2 3.70 53 98.15 33-34 days 1 1.85 54 100.00 4

Graphical Methods Histograms Stem-and leaf plots Box plots Measure of Location Mean Median Quartiles SUMMARIZING DATA Measures of Spread Range (R) Mean absolute deviation (MAD) Variance (S 2 ) Standard deviation (s or SD) Interquartile range (IQR) 5

LOS for 54 Pneumonia Patients Frequency Histogram 6

LOS for 54 Pneumonia Patients Relative Frequency Histogram 7

Stem-and-Leaf Plot Los for 54 Pneumonia Patients Stem Leaf # Boxplot 34 0 1 * 32 30 00 2 0 28 0 1 0 26 0 1 0 24 22 0 1 0 20 18 16 00 2 14 000 3 12 00000 5 +-----+ 10 000000 6 + 8 00000000000000 14 *-----* 6 0000000 7 +-----+ 4 000000000 9 2 00 2 ----+----+----+----+ 8

Constructing a Stem-and Leaf Plot 34 32 30 28 26 24 22 20 18 16 14 12 10 0 8 00 6 4 0 2 <=== Step 4 === represents 4th data point, 11.0; and so on <=== Steps 1 and 2 === represent 1st and 2nd data points, 8.0 and 8.0 <=== Step 3 === represents 3rd data point, 4.0 Continue to fill in the plot until all data points have been plotted. Note that the data do not have to be entered in sorted order. 9

How Many Stem Lines? What Interval Between Stems? Maximum number of stem lines L = [ 10 x log 10 n ], where [x]=greatest integer function example: n =54, L= [10 x log 54] = 18 L for various values of n: n 20 50 100 150 200 300 L 13 17 20 22 24 25 Interval Size = range / L, rounded to nearest power of 10 example: n=54, L= 18, range= 34-2=32 interval size = 32/18 = 1.8 rounded to 1 10

Los for 54 Pneumonia Patients Stem Leaf # Boxplot 34 0 1 * 32 30 00 2 0 28 0 1 0 26 0 1 0 24 22 0 1 0 20 18 16 00 2 14 000 3 12 00000 5 +-----+ 10 000000 6 + 8 00000000000000 14 *-----* 6 0000000 7 +-----+ 4 000000000 9 2 00 2 ----+----+----+----+ 11

Computing the Mean Suppose there are n observations: X 1, X 2,, X n Mean = X n i 1 n X i FACTS: The mean measures the central tendency of the data. The mean is sensitive to extreme observations known as outliers. Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 X = mean = 576 / 54 = 10.7 days 12

Computing the Median The median is the middle value that splits the data set into two equal parts To compute the median (M), arrange the X i in ascending order: X (1), X (2), X (3),., X (n) Where X (1) = smallest value, X (2) = 2 nd smallest value,, X (n) = largest value The median is defined as the middle observation, which corresponds to the ordered observation in position (n + 1) / 2 ( depth ) Note that if n is an odd number, then the median falls out precisely on the middle observation, X ((n+1)/2) If n is an even number, then the median falls out halfway between the two middle observations, X (n/2) and X (n/2 + 1). In other words, median = (X (n/2) + X (n/2 + 1) ) / 2 The median is said to be robust because it is not sensitive to outliers. 13

Computing the Median (continued) Ordered data: 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 10 10 10 10 10 11 12 12 12 13 13 14 14 15 16 17 22 26 28 30 30 34 n = 54 Since n is even, then M is the average of the middle two numbers, i.e. M = median = (n+1) / 2 = 55 / 2 = 27.5 => average of obs # 27 and # 28 = 8 days If n is odd, then M is simply the middle number, i.e. M = median = (n+ 1) / 2 14

Computing the Lower and Upper Quartiles ( Hinges ) The quartiles split the set of data into four equal parts. Lower quartile Q 1 = median of lower half = (n+1) / 4 Upper quartile Q 3 = median of upper half = 3*(n+1) / 4 Facts: The quartiles split the sample into quarters. Half of the observations lie between Q1 and Q3. The quartiles are said to be robust because they are not sensitive to outliers. There are several different methods for computing quartiles. To compute the quartiles, refer to the ordered data Q 1 = lower quartile = (total obs + 1) / 4 = (54+1) / 4 = 13.75 => average of obs # 13 and # 14 = 6 days Q 3 = upper quartile = 3 * (total obs+1) / 4 = 3 * (54+1) / 4 = 41.25 => average of obs # 41 and # 42 = 12.5 days 15

A measure of location, alone, does not adequately describe a set of data!! 16

Same Location Different Spread 17

Computing Measures of Spread Suppose there are n observations, X 1, X 2,.., X n Range = X max X min Mean absolute deviation = MAD = Xi n - X Variance = s 2 = n 2 X) Standard deviation = SD = s = (X- i (X- i n 2 X) Interquartile range = IQR = Q 3 Q 1 FACTS: The range, MAD, variance. SD and IQR all measure the amount of variation (spread) in the data. All measures except the MAD and IQR are sensitive to extreme observations known as outliers. MAD and IQR are robust measures of spread. 18

Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 Summary for LOS Example Location X = 10.7 days M = 8 days Q 1 = 6 days Q 3 = 12.5 days Spread R = 31 days SD = 7.2 days MAD = 5.1 days IQR = 6.5 days 19

The Boxplot The boxplot is a convenient way of depicting the distribution of data using measures of location and spread. The most important parts of a boxplot correspond to the lower and upper quartiles, the median, and the mean. Sometimes known as a box-and-whisker plot. 20

Inner Fence Q3 + 1.5 x IQR Anatomy of a Boxplot Q 3 Median Q 1 Inner Fence Q1-1.5 x IQR + Mean 21

Schematic Plots LOS 25 + 20 + +-----+ 15 + +-----+ +-----+ *--+--* + + +-----+ 10 + *-----* +-----+ *-----* *-----* + +-----+ +-----+ 5 + +-----+ 0 + -------+-----------+-----------+-----------+---- 1 West 2 South 3 North 4 East Side-by-Side Boxplots LOS for Four Nursing Stations Nursing Station 22

Salary Levels, by Gender 23

REFERENCES Hoaglin DC, Mosteller F, Tukey JW. Understanding Robust and Exploratory Data Analysis. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., 1983. Chambers JM, Cleveland WS, Kleiner B, Tukey PA. Graphical Methods for Data Analysis. Wadsworth Statistics/Probability Series. Duxbury Press, 1983. Mosteller, Tukey JW. Data Analysis and Regression, A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods, 1977. Velleman PF, Hoaglin DC. Applications, Basics, and Computing of Exploratory Data Analysis (A-B-Cs of EDA). PWS Publishers, Duxbury Press, 1981. 25

Introduction to Exploratory Data Analysis (EDA) Data Transformations Part 1

Why Transform* Data? 1. Classical Inference a. To achieve homoscedasticity (ANOVA, t-test do not work with unequal variances) b. To achieve normality c. To straighten out plots d. To conform to known physical laws 2. Exploratory Data Analysis (EDA) a. To symmetrize/normalize b. To explore data c. To compare distributions d. To linearize plots e. To create confusion (??) * EDAers use the work re-express 28

Displaying Data Using a Stem-and-Leaf Plot LOS for 54 Pneumonia Patients Hypothetical Example Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 29

Displaying Data (The EDA Way) 1. Stem-and-Leaf Displays (Organize Data) 2. Letter-Value Displays (Summarize Data) Example 1: Bilirubin of 95 Patients Who Underwent the Whipple Procedure 1940-1980, With Pathological Dx of Cancer 14.5 Pancreas 13.1 Pancreas 8.1 Bile Duct 31.3 Ampulla 12.6 Pancreas 4.2 Other 22.2 Bile Duct...... 32

Whipple Procedure Bilirubin of 95 patients Stem Leaf # 31 3 1 30 29 28 2 1 27 7 1 26 03 2 25 24 012 3 23 22 223 3 21 0 1 20 0028 4 19 18 02 2 17 8 1 16 168 3 15 0 1 14 356 3 13 001446 6 12 126679 6 11 122245 6 10 789 3 9 001689 6 8 1 1 7 334 3 6 0145788 7 5 017 3 4 2559 4 3 134 3 2 0389 4 1 02 2 0 233333444566688 15 ----+----+----+----+ 33

Example 2 Zinc levels in patients with Epidermoid Cancer of the head and neck Patients with stable nutritional status = 25 Stem Leaf # 11 9 1 10 145679 6 9 01455555 8 8 01236889 8 7 89 2 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 Patients with impaired nutritional status = 25 Stem Leaf # 8 01 2 7 22233444477889 14 6 15568 5 5 167 3 4 5 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 34

Letter-Value Displays Extremes (1) d(1) = 1 Sixteenths (D) d(d) = ( [d(e)] + 1) / 2 Eighths (E) d(e) = ( [d(h)] + 1) / 2 Hinges (H) d(h) = ( [d(m)] + 1) / 2 Median (M) d(m) = (n+1) / 2 Mid-Summaries mid 1 Mid-range = (min + max)/2 mid D Mid-sixteenth = (D L + D U )/2 mid E Mid-eighth = (D L + D U )/2 mid H Mid-hinge = (H L + H U )/2 med Median Spreads 1 spread D spread E spread H spread range = max min = D U - D L = E U - E L Interquartile range = H U - H L

Bilirubin (n=95) Letter-Value-Displays for the Examples LOWER UPPER MID SPREAD M 48 9.9 9.9 H 24.5 3.8 14.4 9.1 10.6 E 12.5 0.6 20.9 10.75 20.3 D 6.5 0.35 24.5 12.43 24.15 1 1 0.2 31.3 15.75 31.1 Zinc-Stable (n=25) LOWER UPPER MID SPREAD M 13 94 94 H 7 86 101 93.5 15 E 4 81 106 93.5 25 D 2.5 79.5 108 93.75 28.5 1 1 78 119 98.5 41 Zinc-Impaired (n=25) DEPTH DEPTH DEPTH LOWER UPPER MID SPREAD M 13 73 73 H 7 65 77 71 12 E 4 57 78 67.5 21 D 2.5 53.5 79.5 66.5 26 1 1 45 81 63 36 36

Look at Skewness Bilirubin MID M 9.9 H 9.1 E 10.75 D 12.43 1 15.75 Zinc Stable MID M 94 H 93.5 E 93.5 D 93.75 1 98.5 Mid-Summaries increasing === Skewed RIGHT Not much of a trend - fairly symmetric Zinc Impaired MID M 73 H 71 E 67.5 D 66.5 1 63 Mid-Summaries decreasing === Slightly Skewed LEFT 37

Choice of a Transformation Ladder of Powers: X X P P Transformation Name Naturals...... 2 X 2 square 1 X raw ½ x square root counts (0) log X logarithm biochemical measures 1-1/2 reciprocal x -1-1/X reciprocal waiting times (=> rates) -2-1/X 2...... Note: Use of negative multipler for p<0 preserves natural order 38

P > 1 Effect of Transformation X X p Pull in Stretched-out Lower tail Stretch out Bunched-in Upper tail P>1 X X X p P < 1 Pull in Stretched-out Upper Tail Stretch out Bunched-in Lower Tail P<1 X X p 39

Bilirubin Data Effect of Transformation: An Example Mid-raw Mid- Mid-log (ln) M 9.9 3.15 2.29 H 9.1 2.87 2.00 E 10.75 2.67 1.26 D 12.43 2.77 1.07 1 15.75 3.02 0.92 Skewed Right About Right? (symmetric?) Skewed Left Ladder of p = 1 1/2 0 Powers Seems to stretch out lower tail too much!! 40

Effect of Transformation for Bilirubin Data Raw data Square root log 41

STARTS Problem: Can t take log x for x 0 Can t take even roots - x, 4 6 x, x, etc. for x 0 Some Solutions: 1. Use log (x+c) instead of log x (c is the Start ) c should be small compared to the typical size of data values. e.g. log (x+¼) log (x+½) log (x+1) 2. If all x s are negative, it is easier and better to simply multiply by -1 first, then take logs or even roots. 3. If only some x s are negative, then adding a constant might be ok. 42

Comparing To The Normal Distribution After transforming a data set to a (roughly) symmetric shape, can the new distribution be compared to normality? Yes - Compare spreads to normal spreads Name Spreads For N (0,1) Distribution Spread H 1.349 E 2.301 D 3.068 (See Velleman & Hoaglin for more) If distribution is normal, then the quotients (H-Spread) / 1.349 (E-Spread) / 2.301 Should be nearly equal (D-Spread) / 3.068 If quotients increase then heavy tails. If quotients decrease than light tails. Note: Can use (H-Spread) / 1.349 as estimate of 43

Since Comparison to Normality: An Example Bilirubin we ll look at that is quite symmetric Bilirubin Spread s M - H 1.85 1.37 (= 1.85 / 1.349) E 3.80 1.65 (= 3.80 / 2.301) D 4.36 1.42 (= 4.36 / 3.068) Also, look at zinc-stable Zinc M H E D Spread - 15.0 25.0 28.5 s 11.1 10.9 9.3 44

A. AMOUNTS AND COUNTS log x x 1/2 x -1 Transformations Useful in Common Situations Example: White blood counts, glucose levels, number of patients seen in clinic per month. * log is especially useful if the ratio of the largest to smallest observation is large. B. BALANCES (i.e., real numbers) Often not transformed, but if necessary do it!! Example: Deviation from ideal body weight C. COUNTED FRACTIONS x x - A i.e., p orp n B - A * use folded values with transform (p) = f (p) f (1-p) [symmetry is natural] froots: flogs: p pluralitie s : - 1- p logit (p)logp(1- p)logp - p -(1- p)2p-1 log(1- p) Example: proportion of patients responding to rx percentage of sperm with oval shape D. RANKS (i.e., 1, 2, 3,, n) similar to fractions 45

Another Example Duration of operation for 100 patients with Epidural Anesthesia (time recorded in minutes) DEPTH LOWER UPPER MID SPREAD M 50.5 67.5 67.5 H 25.5 60 90 75 30 E 13 45 120 82.5 75 D 7 40 135 87.5 95 1 1 30 195 112.5 165 ** Stretched out Upper Tail Suggests X p with p<1 Stem Leaf # 19 5 1 18 00 2 17 16 0 1 15 0 1 14 0 1 13 55555 5 12 000000 6 11 10 5555 4 9 000000000000000 15 8 0 1 7 5555555555555 13 6 00000000000000000000000 23 5 4 055555555555555555555 21 3 000000 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**+1 46

Since log (p = 0) is slightly skewed right and 100 / OPTIME (p = -1) is skewed left, then a power between 0 and -1 might work Try p = -1/2 i.e., 100 OPTIME Stem Leaf # 19 5 1 18 00 2 17 16 0 1 15 0 1 14 0 1 13 55555 5 12 000000 6 11 10 5555 4 9 000000000000000 15 8 0 1 7 5555555555555 13 6 00000000000000000000000 23 5 4 055555555555555555555 21 3 000000 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**+1 47

p = - 1/2 100 OPTIME MID M = -12.2 H = -11.7 E = -12.0 D = -12.2 1 = -12.7 Stem Leaf # -7 2 1-7 955 3-8 2 1-8 666665 6-9 111111 6-9 8888 4-10 -10 555555555555555 15-11 2 1-11 5555555555555 13-12 -12 99999999999999999999999 23-13 -13-14 -14 99999999999999999999 20-15 -15 8 1-16 -16-17 -17-18 333333 6 ----+----+----+----+--- 48

MID p = 0 log (OPTIME) M = 4.2 H = 4.295 E = 4.3 D = 4.3 1 = 4.335 Stem Leaf # 52 7 1 51 99 2 50 18 2 49 111114 6 48 47 999999 6 46 5555 4 45 000000000000000 15 44 43 22222222222228 14 42 41 40 99999999999999999999999 23 39 38 11111111111111111111 20 37 36 9 1 35 34 000000 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**-1 Pretty good!!? 49

p = - 1-100 / OPTIME MID M = -1.48 H = -1.39 E = -1.53 D = -1.62 1 = -1.92 Stem Leaf # -4 661 3-6 44444172 8-8 5555333333 10-10 111111111111111 15-12 33333333333335 14-14 -16 77777777777777777777777 23-18 -20-22 22222222222222222222 20-24 0 1-26 -28-30 -32 333333 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**-1 Now Skewed to the low end 50

p = 1/2 OPTIME MID M = 8.22 H = 8.62 E = 8.83 D = 8.97 1 = 9.72 Stem Leaf # 14 0 1 13 13 44 2 12 6 1 12 2 1 11 666668 6 11 000000 6 10 10 2222 4 9 555555555555555 15 9 8 77777777777779 14 8 7 77777777777777777777777 23 7 6 77777777777777777777 20 6 3 1 5 555555 6 5 ----+----+----+----+--- Less skewness, but it still exists 51

Comparing OPTIME Spreads to Normal Distribution Standardized Spread -100 OPTIME OPTIME OP TIM E log -100 OPTIME H 22.2 1.29.30.64 1.78 E 32.6 1.84.43.60 2.52 D 30.9 1.73.40.57 2.35 about right? 52

Graphical Comparison of OPTIME Spreads to Normal Distribution OPTIME -100 OPTIME OPTIME -100 OPTIME log(optime) 53

Example 4: Peak Common Bile Duct Pressure During an operation, common bile duct pressure is measured every 2 minutes for 20 minutes. The ratio of pressure at time t to baseline (t = 0) is calculated. The peak ratio is recorded. Peak Ratio STD MID SPR SPR M 1.94 - H 1.90 1.00 E 2.15 1.79 D 2.10 1.80 1 2.33 2.64-10 Peak Ratio Stem Leaf # 36 5 1 34 33 2 32 30 0004 4 28 0 1 26 0 1 24 000034 6 22 369 3 20 0011489 7 18 08 2 16 0055789 7 14 00034 5 12 005555689 9 10 19 2 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Stem Leaf # STD MID SPR SPR M -7.18 - - H -7.45 2.00 1.48 E -7.34 3.20 1.39 D -7.45 3.36 1.10 1-7.59 4.72 - -5 442 3-5 8887 4-6 4420 4-6 997765555 9-7 311110 6-7 99887775 8-8 43 2-8 9999988555 10-9 11 2-9 6 1-10 0 1 ----+----+----+----+ 54

A look ahead.. Variance stabilization Straightening x-y plots Interpretation and reporting 55