Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

Similar documents
Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

1. Exploratory Data Analysis

Chapter 2: Tools for Exploring Univariate Data

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

STAT 200 Chapter 1 Looking at Data - Distributions

Chapter 1: Exploring Data

CHAPTER 1. Introduction

Chapter 3. Data Description

Chapter 4. Displaying and Summarizing. Quantitative Data

Lecture 2 and Lecture 3

2011 Pearson Education, Inc

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Describing Distributions with Numbers

Units. Exploratory Data Analysis. Variables. Student Data

Describing distributions with numbers

Exploratory data analysis: numerical summaries

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

are the objects described by a set of data. They may be people, animals or things.

Chapter 3. Measuring data

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

CHAPTER 2: Describing Distributions with Numbers

Describing distributions with numbers

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

TOPIC: Descriptive Statistics Single Variable

MgtOp 215 Chapter 3 Dr. Ahn

Statistics and parameters

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

Descriptive statistics

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Descriptive Univariate Statistics and Bivariate Correlation

CIVL 7012/8012. Collection and Analysis of Information

Performance of fourth-grade students on an agility test

P8130: Biostatistical Methods I

Statistics for Managers using Microsoft Excel 6 th Edition

MATH 1150 Chapter 2 Notation and Terminology

Lecture 6: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Elementary Statistics

Chapter2 Description of samples and populations. 2.1 Introduction.

3.1 Measure of Center

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Review: Central Measures

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

1 Measures of the Center of a Distribution

Statistics I Chapter 2: Univariate data analysis

Histograms allow a visual interpretation

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Sections 2.3 and 2.4

Determining the Spread of a Distribution

Determining the Spread of a Distribution

Statistics I Chapter 2: Univariate data analysis

Practice problems from chapters 2 and 3

OBJECTIVES INTRODUCTION

Descriptive Statistics-I. Dr Mahmoud Alhussami

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Descriptive Statistics

After completing this chapter, you should be able to:

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

MATH 117 Statistical Methods for Management I Chapter Three

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

University of Jordan Fall 2009/2010 Department of Mathematics

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Full file at

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

CHAPTER 2 Description of Samples and Populations

Statistics in medicine

AP Final Review II Exploring Data (20% 30%)

Section 2.4. Measuring Spread. How Can We Describe the Spread of Quantitative Data? Review: Central Measures

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Clinical Research Module: Biostatistics

Chapter. Numerically Summarizing Data Pearson Prentice Hall. All rights reserved

3 GRAPHICAL DISPLAYS OF DATA

A is one of the categories into which qualitative data can be classified.

Section 3. Measures of Variation

Resistant Measure - A statistic that is not affected very much by extreme observations.

Letter-value plots: Boxplots for large data

Graphical Techniques Stem and Leaf Box plot Histograms Cumulative Frequency Distributions

Lecture 1: Descriptive Statistics

Unit 2: Numerical Descriptive Measures

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

MATH 1015: Life Science Statistics. Lecture Pack for Chapter 1 Weeks 1-3. Lecturer: Jennifer Chan Room: Carslaw Room 817 Telephone:

Slide 1. Slide 2. Slide 3. Pick a Brick. Daphne. 400 pts 200 pts 300 pts 500 pts 100 pts. 300 pts. 300 pts 400 pts 100 pts 400 pts.

2.1 Measures of Location (P.9-11)

CHAPTER 1 Exploring Data

Unit 2. Describing Data: Numerical

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

Summarising numerical data

BIOS 2041: Introduction to Statistical Methods

Chapter 2 Descriptive Statistics

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Math 140 Introductory Statistics

Math 140 Introductory Statistics

1.3: Describing Quantitative Data with Numbers

Stat 20: Intro to Probability and Statistics

Transcription:

PREP Course #10: Introduction to Exploratory Data Analysis and Data Transformations (Part 1) Martin L. Lesser, PhD Biostatistics Unit Feinstein Institute for Medical Research North Shore-LIJ Health System

CME Disclosure Statement The North Shore LIJ Health System adheres to the ACCME s new Standards for Commercial Support. Any individuals in a position to control the content of a CME activity, including faculty, planners, and managers, are required to disclose all financial relationships with commercial interests. All identified potential conflicts of interest are thoroughly vetted by the North Shore-LIJ for fair balance and scientific objectivity and to ensure appropriateness of patient care recommendations. Course Director and Course Planners, Kevin Tracey, MD, Cynthia Hahn, Emmelyn Kim, MPH, Tina Chuck, MPH have nothing to disclose. Martin L Lesser, PhD, EMT-CC have nothing to disclose

Quick Review Measures of location mean, median, quartiles, quantiles Measures of spread range, standard deviation, interquartile range, interquantile range Quick displays of data stem-and-leaf plot, box (and whisker) plot 3

LOS for 54 Pneumonia Patients Hypothetical Example Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 Frequency Distribution Cumulative Cumulative LOS Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 3-4 days 6 11.11 6 11.11 5-6 days 10 18.52 16 29.63 7-8 days 12 22.22 28 51.85 9-10 days 9 16.67 37 68.52 11-12 days 4 7.41 41 75.93 13-14 days 4 7.41 45 83.33 15-16 days 2 3.70 47 87.04 17-18 days 1 1.85 48 88.89 21-22 days 1 1.85 49 90.74 25-26 days 1 1.85 50 92.59 27-28 days 1 1.85 51 94.44 29-30 days 2 3.70 53 98.15 33-34 days 1 1.85 54 100.00 4

Graphical Methods Histograms Stem-and leaf plots Box plots Measure of Location Mean Median Quartiles SUMMARIZING DATA Measures of Spread Range (R) Mean absolute deviation (MAD) Variance (S 2 ) Standard deviation (s or SD) Interquartile range (IQR) 5

LOS for 54 Pneumonia Patients Frequency Histogram 6

LOS for 54 Pneumonia Patients Relative Frequency Histogram 7

Stem-and-Leaf Plot Los for 54 Pneumonia Patients Stem Leaf # Boxplot 34 0 1 * 32 30 00 2 0 28 0 1 0 26 0 1 0 24 22 0 1 0 20 18 16 00 2 14 000 3 12 00000 5 +-----+ 10 000000 6 + 8 00000000000000 14 *-----* 6 0000000 7 +-----+ 4 000000000 9 2 00 2 ----+----+----+----+ 8

Constructing a Stem-and Leaf Plot 34 32 30 28 26 24 22 20 18 16 14 12 10 0 8 00 6 4 0 2 <=== Step 4 === represents 4th data point, 11.0; and so on <=== Steps 1 and 2 === represent 1st and 2nd data points, 8.0 and 8.0 <=== Step 3 === represents 3rd data point, 4.0 Continue to fill in the plot until all data points have been plotted. Note that the data do not have to be entered in sorted order. 9

How Many Stem Lines? What Interval Between Stems? Maximum number of stem lines L = [ 10 x log 10 n ], where [x]=greatest integer function example: n =54, L= [10 x log 54] = 18 L for various values of n: n 20 50 100 150 200 300 L 13 17 20 22 24 25 Interval Size = range / L, rounded to nearest power of 10 example: n=54, L= 18, range= 34-2=32 interval size = 32/18 = 1.8 rounded to 1 10

Los for 54 Pneumonia Patients Stem Leaf # Boxplot 34 0 1 * 32 30 00 2 0 28 0 1 0 26 0 1 0 24 22 0 1 0 20 18 16 00 2 14 000 3 12 00000 5 +-----+ 10 000000 6 + 8 00000000000000 14 *-----* 6 0000000 7 +-----+ 4 000000000 9 2 00 2 ----+----+----+----+ 11

Computing the Mean Suppose there are n observations: X 1, X 2,, X n Mean = X n i 1 n X i FACTS: The mean measures the central tendency of the data. The mean is sensitive to extreme observations known as outliers. Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 X = mean = 576 / 54 = 10.7 days 12

Computing the Median The median is the middle value that splits the data set into two equal parts To compute the median (M), arrange the X i in ascending order: X (1), X (2), X (3),., X (n) Where X (1) = smallest value, X (2) = 2 nd smallest value,, X (n) = largest value The median is defined as the middle observation, which corresponds to the ordered observation in position (n + 1) / 2 ( depth ) Note that if n is an odd number, then the median falls out precisely on the middle observation, X ((n+1)/2) If n is an even number, then the median falls out halfway between the two middle observations, X (n/2) and X (n/2 + 1). In other words, median = (X (n/2) + X (n/2 + 1) ) / 2 The median is said to be robust because it is not sensitive to outliers. 13

Computing the Median (continued) Ordered data: 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 10 10 10 10 10 11 12 12 12 13 13 14 14 15 16 17 22 26 28 30 30 34 n = 54 Since n is even, then M is the average of the middle two numbers, i.e. M = median = (n+1) / 2 = 55 / 2 = 27.5 => average of obs # 27 and # 28 = 8 days If n is odd, then M is simply the middle number, i.e. M = median = (n+ 1) / 2 14

Computing the Lower and Upper Quartiles ( Hinges ) The quartiles split the set of data into four equal parts. Lower quartile Q 1 = median of lower half = (n+1) / 4 Upper quartile Q 3 = median of upper half = 3*(n+1) / 4 Facts: The quartiles split the sample into quarters. Half of the observations lie between Q1 and Q3. The quartiles are said to be robust because they are not sensitive to outliers. There are several different methods for computing quartiles. To compute the quartiles, refer to the ordered data Q 1 = lower quartile = (total obs + 1) / 4 = (54+1) / 4 = 13.75 => average of obs # 13 and # 14 = 6 days Q 3 = upper quartile = 3 * (total obs+1) / 4 = 3 * (54+1) / 4 = 41.25 => average of obs # 41 and # 42 = 12.5 days 15

A measure of location, alone, does not adequately describe a set of data!! 16

Same Location Different Spread 17

Computing Measures of Spread Suppose there are n observations, X 1, X 2,.., X n Range = X max X min Mean absolute deviation = MAD = Xi n - X Variance = s 2 = n 2 X) Standard deviation = SD = s = (X- i (X- i n 2 X) Interquartile range = IQR = Q 3 Q 1 FACTS: The range, MAD, variance. SD and IQR all measure the amount of variation (spread) in the data. All measures except the MAD and IQR are sensitive to extreme observations known as outliers. MAD and IQR are robust measures of spread. 18

Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 Summary for LOS Example Location X = 10.7 days M = 8 days Q 1 = 6 days Q 3 = 12.5 days Spread R = 31 days SD = 7.2 days MAD = 5.1 days IQR = 6.5 days 19

The Boxplot The boxplot is a convenient way of depicting the distribution of data using measures of location and spread. The most important parts of a boxplot correspond to the lower and upper quartiles, the median, and the mean. Sometimes known as a box-and-whisker plot. 20

Inner Fence Q3 + 1.5 x IQR Anatomy of a Boxplot Q 3 Median Q 1 Inner Fence Q1-1.5 x IQR + Mean 21

Schematic Plots LOS 25 + 20 + +-----+ 15 + +-----+ +-----+ *--+--* + + +-----+ 10 + *-----* +-----+ *-----* *-----* + +-----+ +-----+ 5 + +-----+ 0 + -------+-----------+-----------+-----------+---- 1 West 2 South 3 North 4 East Side-by-Side Boxplots LOS for Four Nursing Stations Nursing Station 22

Salary Levels, by Gender 23

24

REFERENCES Hoaglin DC, Mosteller F, Tukey JW. Understanding Robust and Exploratory Data Analysis. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., 1983. Chambers JM, Cleveland WS, Kleiner B, Tukey PA. Graphical Methods for Data Analysis. Wadsworth Statistics/Probability Series. Duxbury Press, 1983. Mosteller, Tukey JW. Data Analysis and Regression, A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods, 1977. Velleman PF, Hoaglin DC. Applications, Basics, and Computing of Exploratory Data Analysis (A-B-Cs of EDA). PWS Publishers, Duxbury Press, 1981. 25

Introduction to Exploratory Data Analysis (EDA) Data Transformations Part 1

REFERENCES Hoaglin DC, Mosteller F, Tukey JW. Understanding Robust and Exploratory Data Analysis. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., 1983. Chambers JM, Cleveland WS, Kleiner B, Tukey PA. Graphical Methods for Data Analysis. Wadsworth Statistics/Probability Series. Duxbury Press, 1983. Mosteller, Tukey JW. Data Analysis and Regression, A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods, 1977. Velleman PF, Hoaglin DC. Applications, Basics, and Computing of Exploratory Data Analysis (A-B-Cs of EDA). PWS Publishers, Duxbury Press, 1981. 27

Why Transform* Data? 1. Classical Inference a. To achieve homoscedasticity (ANOVA, t-test do not work with unequal variances) b. To achieve normality c. To straighten out plots d. To conform to known physical laws 2. Exploratory Data Analysis (EDA) a. To symmetrize/normalize b. To explore data c. To compare distributions d. To linearize plots e. To create confusion (??) * EDAers use the work re-express 28

Displaying Data Using a Stem-and-Leaf Plot LOS for 54 Pneumonia Patients Hypothetical Example Observed data (days), n = 54: 8 8 4 11 6 8 5 14 10 16 4 5 12 8 3 7 14 9 6 6 5 8 34 6 10 22 9 9 7 5 8 8 10 4 8 10 17 5 13 15 4 12 12 10 6 3 3 8 9 8 26 28 30 30 29

Constructing a Stem-and Leaf Plot 34 32 30 28 26 24 22 20 18 16 14 12 10 0 8 00 6 4 0 2 <=== Step 4 === represents 4th data point, 11.0; and so on <=== Steps 1 and 2 === represent 1st and 2nd data points, 8.0 and 8.0 <=== Step 3 === represents 3rd data point, 4.0 Continue to fill in the plot until all data points have been plotted. Note that the data do not have to be entered in sorted order. 30

Los for 54 Pneumonia Patients Stem Leaf # Boxplot 34 0 1 * 32 30 00 2 0 28 0 1 0 26 0 1 0 24 22 0 1 0 20 18 16 00 2 14 000 3 12 00000 5 +-----+ 10 000000 6 + 8 00000000000000 14 *-----* 6 0000000 7 +-----+ 4 000000000 9 2 00 2 ----+----+----+----+ 31

Displaying Data (The EDA Way) 1. Stem-and-Leaf Displays (Organize Data) 2. Letter-Value Displays (Summarize Data) Example 1: Bilirubin of 95 Patients Who Underwent the Whipple Procedure 1940-1980, With Pathological Dx of Cancer 14.5 Pancreas 13.1 Pancreas 8.1 Bile Duct 31.3 Ampulla 12.6 Pancreas 4.2 Other 22.2 Bile Duct...... 32

Whipple Procedure Bilirubin of 95 patients Stem Leaf # 31 3 1 30 29 28 2 1 27 7 1 26 03 2 25 24 012 3 23 22 223 3 21 0 1 20 0028 4 19 18 02 2 17 8 1 16 168 3 15 0 1 14 356 3 13 001446 6 12 126679 6 11 122245 6 10 789 3 9 001689 6 8 1 1 7 334 3 6 0145788 7 5 017 3 4 2559 4 3 134 3 2 0389 4 1 02 2 0 233333444566688 15 ----+----+----+----+ 33

Example 2 Zinc levels in patients with Epidermoid Cancer of the head and neck Patients with stable nutritional status = 25 Stem Leaf # 11 9 1 10 145679 6 9 01455555 8 8 01236889 8 7 89 2 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 Patients with impaired nutritional status = 25 Stem Leaf # 8 01 2 7 22233444477889 14 6 15568 5 5 167 3 4 5 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 34

Letter-Value Displays Extremes (1) d(1) = 1 Sixteenths (D) d(d) = ( [d(e)] + 1) / 2 Eighths (E) d(e) = ( [d(h)] + 1) / 2 Hinges (H) d(h) = ( [d(m)] + 1) / 2 Median (M) d(m) = (n+1) / 2 Mid-Summaries mid 1 Mid-range = (min + max)/2 mid D Mid-sixteenth = (D L + D U )/2 mid E Mid-eighth = (D L + D U )/2 mid H Mid-hinge = (H L + H U )/2 med Median Spreads 1 spread D spread E spread H spread range = max min = D U - D L = E U - E L Interquartile range = H U - H L

Bilirubin (n=95) Letter-Value-Displays for the Examples LOWER UPPER MID SPREAD M 48 9.9 9.9 H 24.5 3.8 14.4 9.1 10.6 E 12.5 0.6 20.9 10.75 20.3 D 6.5 0.35 24.5 12.43 24.15 1 1 0.2 31.3 15.75 31.1 Zinc-Stable (n=25) LOWER UPPER MID SPREAD M 13 94 94 H 7 86 101 93.5 15 E 4 81 106 93.5 25 D 2.5 79.5 108 93.75 28.5 1 1 78 119 98.5 41 Zinc-Impaired (n=25) DEPTH DEPTH DEPTH LOWER UPPER MID SPREAD M 13 73 73 H 7 65 77 71 12 E 4 57 78 67.5 21 D 2.5 53.5 79.5 66.5 26 1 1 45 81 63 36 36

Look at Skewness Bilirubin MID M 9.9 H 9.1 E 10.75 D 12.43 1 15.75 Zinc Stable MID M 94 H 93.5 E 93.5 D 93.75 1 98.5 Mid-Summaries increasing === Skewed RIGHT Not much of a trend - fairly symmetric Zinc Impaired MID M 73 H 71 E 67.5 D 66.5 1 63 Mid-Summaries decreasing === Slightly Skewed LEFT 37

Choice of a Transformation Ladder of Powers: X X P P Transformation Name Naturals...... 2 X 2 square 1 X raw ½ x square root counts (0) log X logarithm biochemical measures 1-1/2 reciprocal x -1-1/X reciprocal waiting times (=> rates) -2-1/X 2...... Note: Use of negative multipler for p<0 preserves natural order 38

P > 1 Effect of Transformation X X p Pull in Stretched-out Lower tail Stretch out Bunched-in Upper tail P>1 X X X p P < 1 Pull in Stretched-out Upper Tail Stretch out Bunched-in Lower Tail P<1 X X p 39

Bilirubin Data Effect of Transformation: An Example Mid-raw Mid- Mid-log (ln) M 9.9 3.15 2.29 H 9.1 2.87 2.00 E 10.75 2.67 1.26 D 12.43 2.77 1.07 1 15.75 3.02 0.92 Skewed Right About Right? (symmetric?) Skewed Left Ladder of p = 1 1/2 0 Powers Seems to stretch out lower tail too much!! 40

Effect of Transformation for Bilirubin Data Raw data Square root log 41

STARTS Problem: Can t take log x for x 0 Can t take even roots - x, 4 6 x, x, etc. for x 0 Some Solutions: 1. Use log (x+c) instead of log x (c is the Start ) c should be small compared to the typical size of data values. e.g. log (x+¼) log (x+½) log (x+1) 2. If all x s are negative, it is easier and better to simply multiply by -1 first, then take logs or even roots. 3. If only some x s are negative, then adding a constant might be ok. 42

Comparing To The Normal Distribution After transforming a data set to a (roughly) symmetric shape, can the new distribution be compared to normality? Yes - Compare spreads to normal spreads Name Spreads For N (0,1) Distribution Spread H 1.349 E 2.301 D 3.068 (See Velleman & Hoaglin for more) If distribution is normal, then the quotients (H-Spread) / 1.349 (E-Spread) / 2.301 Should be nearly equal (D-Spread) / 3.068 If quotients increase then heavy tails. If quotients decrease than light tails. Note: Can use (H-Spread) / 1.349 as estimate of 43

Since Comparison to Normality: An Example Bilirubin we ll look at that is quite symmetric Bilirubin Spread s M - H 1.85 1.37 (= 1.85 / 1.349) E 3.80 1.65 (= 3.80 / 2.301) D 4.36 1.42 (= 4.36 / 3.068) Also, look at zinc-stable Zinc M H E D Spread - 15.0 25.0 28.5 s 11.1 10.9 9.3 44

A. AMOUNTS AND COUNTS log x x 1/2 x -1 Transformations Useful in Common Situations Example: White blood counts, glucose levels, number of patients seen in clinic per month. * log is especially useful if the ratio of the largest to smallest observation is large. B. BALANCES (i.e., real numbers) Often not transformed, but if necessary do it!! Example: Deviation from ideal body weight C. COUNTED FRACTIONS x x - A i.e., p orp n B - A * use folded values with transform (p) = f (p) f (1-p) [symmetry is natural] froots: flogs: p pluralitie s : - 1- p logit (p)logp(1- p)logp - p -(1- p)2p-1 log(1- p) Example: proportion of patients responding to rx percentage of sperm with oval shape D. RANKS (i.e., 1, 2, 3,, n) similar to fractions 45

Another Example Duration of operation for 100 patients with Epidural Anesthesia (time recorded in minutes) DEPTH LOWER UPPER MID SPREAD M 50.5 67.5 67.5 H 25.5 60 90 75 30 E 13 45 120 82.5 75 D 7 40 135 87.5 95 1 1 30 195 112.5 165 ** Stretched out Upper Tail Suggests X p with p<1 Stem Leaf # 19 5 1 18 00 2 17 16 0 1 15 0 1 14 0 1 13 55555 5 12 000000 6 11 10 5555 4 9 000000000000000 15 8 0 1 7 5555555555555 13 6 00000000000000000000000 23 5 4 055555555555555555555 21 3 000000 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**+1 46

Since log (p = 0) is slightly skewed right and 100 / OPTIME (p = -1) is skewed left, then a power between 0 and -1 might work Try p = -1/2 i.e., 100 OPTIME Stem Leaf # 19 5 1 18 00 2 17 16 0 1 15 0 1 14 0 1 13 55555 5 12 000000 6 11 10 5555 4 9 000000000000000 15 8 0 1 7 5555555555555 13 6 00000000000000000000000 23 5 4 055555555555555555555 21 3 000000 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**+1 47

p = - 1/2 100 OPTIME MID M = -12.2 H = -11.7 E = -12.0 D = -12.2 1 = -12.7 Stem Leaf # -7 2 1-7 955 3-8 2 1-8 666665 6-9 111111 6-9 8888 4-10 -10 555555555555555 15-11 2 1-11 5555555555555 13-12 -12 99999999999999999999999 23-13 -13-14 -14 99999999999999999999 20-15 -15 8 1-16 -16-17 -17-18 333333 6 ----+----+----+----+--- 48

MID p = 0 log (OPTIME) M = 4.2 H = 4.295 E = 4.3 D = 4.3 1 = 4.335 Stem Leaf # 52 7 1 51 99 2 50 18 2 49 111114 6 48 47 999999 6 46 5555 4 45 000000000000000 15 44 43 22222222222228 14 42 41 40 99999999999999999999999 23 39 38 11111111111111111111 20 37 36 9 1 35 34 000000 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**-1 Pretty good!!? 49

p = - 1-100 / OPTIME MID M = -1.48 H = -1.39 E = -1.53 D = -1.62 1 = -1.92 Stem Leaf # -4 661 3-6 44444172 8-8 5555333333 10-10 111111111111111 15-12 33333333333335 14-14 -16 77777777777777777777777 23-18 -20-22 22222222222222222222 20-24 0 1-26 -28-30 -32 333333 6 ----+----+----+----+--- Multiply Stem.Leaf by 10**-1 Now Skewed to the low end 50

p = 1/2 OPTIME MID M = 8.22 H = 8.62 E = 8.83 D = 8.97 1 = 9.72 Stem Leaf # 14 0 1 13 13 44 2 12 6 1 12 2 1 11 666668 6 11 000000 6 10 10 2222 4 9 555555555555555 15 9 8 77777777777779 14 8 7 77777777777777777777777 23 7 6 77777777777777777777 20 6 3 1 5 555555 6 5 ----+----+----+----+--- Less skewness, but it still exists 51

Comparing OPTIME Spreads to Normal Distribution Standardized Spread -100 OPTIME OPTIME OP TIM E log -100 OPTIME H 22.2 1.29.30.64 1.78 E 32.6 1.84.43.60 2.52 D 30.9 1.73.40.57 2.35 about right? 52

Graphical Comparison of OPTIME Spreads to Normal Distribution OPTIME -100 OPTIME OPTIME -100 OPTIME log(optime) 53

Example 4: Peak Common Bile Duct Pressure During an operation, common bile duct pressure is measured every 2 minutes for 20 minutes. The ratio of pressure at time t to baseline (t = 0) is calculated. The peak ratio is recorded. Peak Ratio STD MID SPR SPR M 1.94 - H 1.90 1.00 E 2.15 1.79 D 2.10 1.80 1 2.33 2.64-10 Peak Ratio Stem Leaf # 36 5 1 34 33 2 32 30 0004 4 28 0 1 26 0 1 24 000034 6 22 369 3 20 0011489 7 18 08 2 16 0055789 7 14 00034 5 12 005555689 9 10 19 2 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Stem Leaf # STD MID SPR SPR M -7.18 - - H -7.45 2.00 1.48 E -7.34 3.20 1.39 D -7.45 3.36 1.10 1-7.59 4.72 - -5 442 3-5 8887 4-6 4420 4-6 997765555 9-7 311110 6-7 99887775 8-8 43 2-8 9999988555 10-9 11 2-9 6 1-10 0 1 ----+----+----+----+ 54

A look ahead.. Variance stabilization Straightening x-y plots Interpretation and reporting 55