Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

Size: px

Start display at page:

Download "Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System"

Kelley Preston
5 years ago
Views:

1 PREP Course #10: Introduction to Exploratory Data Analysis and Data Transformations (Part 1) Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

2 CME Disclosure Statement The North Shore LIJ Health System adheres to the ACCME s new Standards for Commercial Support. Any individuals in a position to control the content of a CME activity, including faculty, planners, and managers, are required to disclose all financial relationships with commercial interests. All identified potential conflicts of interest are thoroughly vetted by the North Shore-LIJ for fair balance and scientific objectivity and to ensure appropriateness of patient care recommendations. Course Director and Course Planners, Kevin Tracey, MD, Cynthia Hahn, Emmelyn Kim, MPH, Tina Chuck, MPH have nothing to disclose. Martin L Lesser, PhD, EMT-CC have nothing to disclose

3 Quick Review Measures of location mean, median, quartiles, quantiles Measures of spread range, standard deviation, interquartile range, interquantile range Quick displays of data stem-and-leaf plot, box (and whisker) plot 3

4 LOS for 54 Pneumonia Patients Hypothetical Example Observed data (days), n = 54: Frequency Distribution Cumulative Cumulative LOS Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 3-4 days days days days days days days days days days days days days

5 Graphical Methods Histograms Stem-and leaf plots Box plots Measure of Location Mean Median Quartiles SUMMARIZING DATA Measures of Spread Range (R) Mean absolute deviation (MAD) Variance (S 2 ) Standard deviation (s or SD) Interquartile range (IQR) 5

6 LOS for 54 Pneumonia Patients Frequency Histogram 6

7 LOS for 54 Pneumonia Patients Relative Frequency Histogram 7

8 Stem-and-Leaf Plot Los for 54 Pneumonia Patients Stem Leaf # Boxplot * *-----*

9 Constructing a Stem-and Leaf Plot <=== Step 4 === represents 4th data point, 11.0; and so on <=== Steps 1 and 2 === represent 1st and 2nd data points, 8.0 and 8.0 <=== Step 3 === represents 3rd data point, 4.0 Continue to fill in the plot until all data points have been plotted. Note that the data do not have to be entered in sorted order. 9

10 How Many Stem Lines? What Interval Between Stems? Maximum number of stem lines L = [ 10 x log 10 n ], where [x]=greatest integer function example: n =54, L= [10 x log 54] = 18 L for various values of n: n L Interval Size = range / L, rounded to nearest power of 10 example: n=54, L= 18, range= 34-2=32 interval size = 32/18 = 1.8 rounded to 1 10

11 Los for 54 Pneumonia Patients Stem Leaf # Boxplot * *-----*

12 Computing the Mean Suppose there are n observations: X 1, X 2,, X n Mean = X n i 1 n X i FACTS: The mean measures the central tendency of the data. The mean is sensitive to extreme observations known as outliers. Observed data (days), n = 54: X = mean = 576 / 54 = 10.7 days 12

13 Computing the Median The median is the middle value that splits the data set into two equal parts To compute the median (M), arrange the X i in ascending order: X (1), X (2), X (3),., X (n) Where X (1) = smallest value, X (2) = 2 nd smallest value,, X (n) = largest value The median is defined as the middle observation, which corresponds to the ordered observation in position (n + 1) / 2 ( depth ) Note that if n is an odd number, then the median falls out precisely on the middle observation, X ((n+1)/2) If n is an even number, then the median falls out halfway between the two middle observations, X (n/2) and X (n/2 + 1). In other words, median = (X (n/2) + X (n/2 + 1) ) / 2 The median is said to be robust because it is not sensitive to outliers. 13

14 Computing the Median (continued) Ordered data: n = 54 Since n is even, then M is the average of the middle two numbers, i.e. M = median = (n+1) / 2 = 55 / 2 = 27.5 => average of obs # 27 and # 28 = 8 days If n is odd, then M is simply the middle number, i.e. M = median = (n+ 1) / 2 14

15 Computing the Lower and Upper Quartiles ( Hinges ) The quartiles split the set of data into four equal parts. Lower quartile Q 1 = median of lower half = (n+1) / 4 Upper quartile Q 3 = median of upper half = 3*(n+1) / 4 Facts: The quartiles split the sample into quarters. Half of the observations lie between Q1 and Q3. The quartiles are said to be robust because they are not sensitive to outliers. There are several different methods for computing quartiles. To compute the quartiles, refer to the ordered data Q 1 = lower quartile = (total obs + 1) / 4 = (54+1) / 4 = => average of obs # 13 and # 14 = 6 days Q 3 = upper quartile = 3 * (total obs+1) / 4 = 3 * (54+1) / 4 = => average of obs # 41 and # 42 = 12.5 days 15

16 A measure of location, alone, does not adequately describe a set of data!! 16

17 Same Location Different Spread 17

18 Computing Measures of Spread Suppose there are n observations, X 1, X 2,.., X n Range = X max X min Mean absolute deviation = MAD = Xi n - X Variance = s 2 = n 2 X) Standard deviation = SD = s = (X- i (X- i n 2 X) Interquartile range = IQR = Q 3 Q 1 FACTS: The range, MAD, variance. SD and IQR all measure the amount of variation (spread) in the data. All measures except the MAD and IQR are sensitive to extreme observations known as outliers. MAD and IQR are robust measures of spread. 18

19 Observed data (days), n = 54: Summary for LOS Example Location X = 10.7 days M = 8 days Q 1 = 6 days Q 3 = 12.5 days Spread R = 31 days SD = 7.2 days MAD = 5.1 days IQR = 6.5 days 19

20 The Boxplot The boxplot is a convenient way of depicting the distribution of data using measures of location and spread. The most important parts of a boxplot correspond to the lower and upper quartiles, the median, and the mean. Sometimes known as a box-and-whisker plot. 20

21 Inner Fence Q x IQR Anatomy of a Boxplot Q 3 Median Q 1 Inner Fence Q1-1.5 x IQR + Mean 21

22 Schematic Plots LOS *--+--* *-----* *-----* *-----* West 2 South 3 North 4 East Side-by-Side Boxplots LOS for Four Nursing Stations Nursing Station 22

23 Salary Levels, by Gender 23

24 24

25 REFERENCES Hoaglin DC, Mosteller F, Tukey JW. Understanding Robust and Exploratory Data Analysis. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., Chambers JM, Cleveland WS, Kleiner B, Tukey PA. Graphical Methods for Data Analysis. Wadsworth Statistics/Probability Series. Duxbury Press, Mosteller, Tukey JW. Data Analysis and Regression, A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods, Velleman PF, Hoaglin DC. Applications, Basics, and Computing of Exploratory Data Analysis (A-B-Cs of EDA). PWS Publishers, Duxbury Press,

26 Introduction to Exploratory Data Analysis (EDA) Data Transformations Part 1

27 REFERENCES Hoaglin DC, Mosteller F, Tukey JW. Understanding Robust and Exploratory Data Analysis. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., Chambers JM, Cleveland WS, Kleiner B, Tukey PA. Graphical Methods for Data Analysis. Wadsworth Statistics/Probability Series. Duxbury Press, Mosteller, Tukey JW. Data Analysis and Regression, A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods, Velleman PF, Hoaglin DC. Applications, Basics, and Computing of Exploratory Data Analysis (A-B-Cs of EDA). PWS Publishers, Duxbury Press,

28 Why Transform* Data? 1. Classical Inference a. To achieve homoscedasticity (ANOVA, t-test do not work with unequal variances) b. To achieve normality c. To straighten out plots d. To conform to known physical laws 2. Exploratory Data Analysis (EDA) a. To symmetrize/normalize b. To explore data c. To compare distributions d. To linearize plots e. To create confusion (??) * EDAers use the work re-express 28

29 Displaying Data Using a Stem-and-Leaf Plot LOS for 54 Pneumonia Patients Hypothetical Example Observed data (days), n = 54:

30 Constructing a Stem-and Leaf Plot <=== Step 4 === represents 4th data point, 11.0; and so on <=== Steps 1 and 2 === represent 1st and 2nd data points, 8.0 and 8.0 <=== Step 3 === represents 3rd data point, 4.0 Continue to fill in the plot until all data points have been plotted. Note that the data do not have to be entered in sorted order. 30

31 Los for 54 Pneumonia Patients Stem Leaf # Boxplot * *-----*

32 Displaying Data (The EDA Way) 1. Stem-and-Leaf Displays (Organize Data) 2. Letter-Value Displays (Summarize Data) Example 1: Bilirubin of 95 Patients Who Underwent the Whipple Procedure , With Pathological Dx of Cancer 14.5 Pancreas 13.1 Pancreas 8.1 Bile Duct 31.3 Ampulla 12.6 Pancreas 4.2 Other 22.2 Bile Duct

33 Whipple Procedure Bilirubin of 95 patients Stem Leaf #

34 Example 2 Zinc levels in patients with Epidermoid Cancer of the head and neck Patients with stable nutritional status = 25 Stem Leaf # Multiply Stem.Leaf by 10**+1 Patients with impaired nutritional status = 25 Stem Leaf # Multiply Stem.Leaf by 10**+1 34

35 Letter-Value Displays Extremes (1) d(1) = 1 Sixteenths (D) d(d) = ( [d(e)] + 1) / 2 Eighths (E) d(e) = ( [d(h)] + 1) / 2 Hinges (H) d(h) = ( [d(m)] + 1) / 2 Median (M) d(m) = (n+1) / 2 Mid-Summaries mid 1 Mid-range = (min + max)/2 mid D Mid-sixteenth = (D L + D U )/2 mid E Mid-eighth = (D L + D U )/2 mid H Mid-hinge = (H L + H U )/2 med Median Spreads 1 spread D spread E spread H spread range = max min = D U - D L = E U - E L Interquartile range = H U - H L

36 Bilirubin (n=95) Letter-Value-Displays for the Examples LOWER UPPER MID SPREAD M H E D Zinc-Stable (n=25) LOWER UPPER MID SPREAD M H E D Zinc-Impaired (n=25) DEPTH DEPTH DEPTH LOWER UPPER MID SPREAD M H E D

Look at Skewness Bilirubin MID M 9.9 H 9.1 E 10.75 D 12.43 1 15.75 Zinc Stable MID M 94 H 93.5 E 93.5 D 93.75 1 98.

37 Look at Skewness Bilirubin MID M 9.9 H 9.1 E D Zinc Stable MID M 94 H 93.5 E 93.5 D Mid-Summaries increasing === Skewed RIGHT Not much of a trend - fairly symmetric Zinc Impaired MID M 73 H 71 E 67.5 D Mid-Summaries decreasing === Slightly Skewed LEFT 37

38 Choice of a Transformation Ladder of Powers: X X P P Transformation Name Naturals X 2 square 1 X raw ½ x square root counts (0) log X logarithm biochemical measures 1-1/2 reciprocal x -1-1/X reciprocal waiting times (=> rates) -2-1/X Note: Use of negative multipler for p<0 preserves natural order 38

39 P > 1 Effect of Transformation X X p Pull in Stretched-out Lower tail Stretch out Bunched-in Upper tail P>1 X X X p P < 1 Pull in Stretched-out Upper Tail Stretch out Bunched-in Lower Tail P<1 X X p 39

40 Bilirubin Data Effect of Transformation: An Example Mid-raw Mid- Mid-log (ln) M H E D Skewed Right About Right? (symmetric?) Skewed Left Ladder of p = 1 1/2 0 Powers Seems to stretch out lower tail too much!! 40

41 Effect of Transformation for Bilirubin Data Raw data Square root log 41

42 STARTS Problem: Can t take log x for x 0 Can t take even roots - x, 4 6 x, x, etc. for x 0 Some Solutions: 1. Use log (x+c) instead of log x (c is the Start ) c should be small compared to the typical size of data values. e.g. log (x+¼) log (x+½) log (x+1) 2. If all x s are negative, it is easier and better to simply multiply by -1 first, then take logs or even roots. 3. If only some x s are negative, then adding a constant might be ok. 42

43 Comparing To The Normal Distribution After transforming a data set to a (roughly) symmetric shape, can the new distribution be compared to normality? Yes - Compare spreads to normal spreads Name Spreads For N (0,1) Distribution Spread H E D (See Velleman & Hoaglin for more) If distribution is normal, then the quotients (H-Spread) / (E-Spread) / Should be nearly equal (D-Spread) / If quotients increase then heavy tails. If quotients decrease than light tails. Note: Can use (H-Spread) / as estimate of 43

44 Since Comparison to Normality: An Example Bilirubin we ll look at that is quite symmetric Bilirubin Spread s M - H (= 1.85 / 1.349) E (= 3.80 / 2.301) D (= 4.36 / 3.068) Also, look at zinc-stable Zinc M H E D Spread s

45 A. AMOUNTS AND COUNTS log x x 1/2 x -1 Transformations Useful in Common Situations Example: White blood counts, glucose levels, number of patients seen in clinic per month. * log is especially useful if the ratio of the largest to smallest observation is large. B. BALANCES (i.e., real numbers) Often not transformed, but if necessary do it!! Example: Deviation from ideal body weight C. COUNTED FRACTIONS x x - A i.e., p orp n B - A * use folded values with transform (p) = f (p) f (1-p) [symmetry is natural] froots: flogs: p pluralitie s : - 1- p logit (p)logp(1- p)logp - p -(1- p)2p-1 log(1- p) Example: proportion of patients responding to rx percentage of sperm with oval shape D. RANKS (i.e., 1, 2, 3,, n) similar to fractions 45

46 Another Example Duration of operation for 100 patients with Epidural Anesthesia (time recorded in minutes) DEPTH LOWER UPPER MID SPREAD M H E D ** Stretched out Upper Tail Suggests X p with p<1 Stem Leaf # Multiply Stem.Leaf by 10**+1 46

47 Since log (p = 0) is slightly skewed right and 100 / OPTIME (p = -1) is skewed left, then a power between 0 and -1 might work Try p = -1/2 i.e., 100 OPTIME Stem Leaf # Multiply Stem.Leaf by 10**+1 47

48 p = - 1/2 100 OPTIME MID M = H = E = D = = Stem Leaf #

49 MID p = 0 log (OPTIME) M = 4.2 H = E = 4.3 D = = Stem Leaf # Multiply Stem.Leaf by 10**-1 Pretty good!!? 49

50 p = / OPTIME MID M = H = E = D = = Stem Leaf # Multiply Stem.Leaf by 10**-1 Now Skewed to the low end 50

51 p = 1/2 OPTIME MID M = 8.22 H = 8.62 E = 8.83 D = = 9.72 Stem Leaf # Less skewness, but it still exists 51

52 Comparing OPTIME Spreads to Normal Distribution Standardized Spread -100 OPTIME OPTIME OP TIM E log -100 OPTIME H E D about right? 52

53 Graphical Comparison of OPTIME Spreads to Normal Distribution OPTIME -100 OPTIME OPTIME -100 OPTIME log(optime) 53

54 Example 4: Peak Common Bile Duct Pressure During an operation, common bile duct pressure is measured every 2 minutes for 20 minutes. The ratio of pressure at time t to baseline (t = 0) is calculated. The peak ratio is recorded. Peak Ratio STD MID SPR SPR M H E D Peak Ratio Stem Leaf # Multiply Stem.Leaf by 10**-1 Stem Leaf # STD MID SPR SPR M H E D

55 A look ahead.. Variance stabilization Straightening x-y plots Interpretation and reporting 55

Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

PREP Course #13: Introduction to Exploratory Data Analysis and Data Transformations (Part 2) Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System