Chapter 2 Descriptive Statistics Lecture 1: Measures of Central Tendency and Dispersion Donald E. Mercante, PhD Biostatistics May 2010 Biostatistics (LSUHSC) Chapter 2 05/10 1 / 34
Lecture 1: Descriptive Statistics We begin with a discussion on Desciptive Statistics, which will be followed later in the course by Inferential Statistics. Descriptive Statistics generally fall into one of two categories: 1 Measures of Location or Central Tendency 2 Measures of Dispersion or Variability Measures of Location Arithmetic Mean Median Mode Geometric Mean Biostatistics (LSUHSC) Chapter 2 05/10 2 / 34
Arithmetic Mean Arithmetic Mean Uses all of the data in the sample Susceptible to extreme values (outliers) Generally, the preferred measure of location for continuous data. Mean = X = 1 N X i = X 1+X 2 + +X N N Median easily determined uses at most two observations and order of data resistant to extreme values [ ] N +1 2 largest observation if N is odd Median = X = N 2 +( N +1) 2 2 largest observation if N even i.e., the median is the middle value if N odd, or average of two middle values if N even. Biostatistics (LSUHSC) Chapter 2 05/10 3 / 34
Mode Mode Mode = Most frequently occurring value(s) in the data set. easily determined not unique uses very little of the data Data Set: Sample of Birthweights for 20 newborns (Table 2.1) i x i i x i i x i i x i 1 3265 6 3323 11 2581 16 2759 2 3260 7 3649 12 2841 17 3248 3 3245 8 3200 13 3609 18 3314 4 3484 9 3031 14 2838 19 3101 5 4146 10 2069 15 3541 20 2834 Biostatistics (LSUHSC) Chapter 2 05/10 4 / 34
Data Summaries Using R to calculate descriptives > birthwgt [1] 3265 3260 3245 3484 4146 3323 3649 3200 3031 2069 2581 2841 3609 2838 3541 2759 3248 3314 3101 2834 Data set sorted in ascending order: > sort(birthwgt) [1] 2069 2581 2759 2834 2838 2841 3031 3101 3200 3245 3248 3260 3265 3314 3323 3484 3541 3609 3649 4146 > summary(birthwgt) Min. 1st Qu. Median Mean 3rd Qu. Max. 2069 2840 3246 3167 3363 4146 Biostatistics (LSUHSC) Chapter 2 05/10 5 / 34
Geometric Mean Geometric Mean particularly useful for right skewed data (e.g., serial dilutions) where log transformation improves symmetry of the distribution. Calculation: Log (x) = 1 N log (x i ) Geometric Mean = anti log ( ) Log(x) if the log was taken base 10, then the antilog is 10 log(x ) If the log was taken base e, then the antilog is e log(x ) R-Code > exp(mean(log(birthwgt))) [1] 3135.317 Biostatistics (LSUHSC) Chapter 2 05/10 6 / 34
Symmetry in Distribution Symmetry in Distribution If the distribution of the data is symmetric, then the Mean, Median, and Mode will coincide. In particular, we will see this is true for data that follow a normal distribution. If the data distribution is skewed, the Median is the preferred measure of location. Biostatistics (LSUHSC) Chapter 2 05/10 7 / 34
Measures of Spread or Variability Measures of Spread or Variability Range Quantiles/Percentiles Variance / Standard Deviation Coeffi cient of Variation Range = R = Max - Min Note on Calculating Percentiles and Quantiles: Pth percentile is found as: Average(np, np+1) largest values, if np is an integer. (1 + Largest integer in np) largest value, if np is not an integer. Biostatistics (LSUHSC) Chapter 2 05/10 8 / 34
Percentiles Calculating Percentiles Example Let the sample size be N=200. percentiles. Np is an Integer. Calculate the 25th, 50th and 75th 25th percentile (p=0.25) : Np = 200(.25) = 50. Since NP is an integer, the 25th percentile would be the average of the 50th and 51st observations starting with the smallest observation. That is, it is the average of the NP and NP+1 observations. Biostatistics (LSUHSC) Chapter 2 05/10 9 / 34
Percentiles Calculating Percentiles Example Let the sample size be N=200. Calculate the 50th. Np is an integer. 50th percentile (p=0.50) : Np = 200(.5) = 100. Since NP is an integer, the 50th percentile (median) is the average of the 100th and 101st observations starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 10 / 34
Percentiles Calculating Percentiles Example Let the sample size be N=200. Calculate the 75th. Np is an integer. 75th percentile (p=0.75) : Np = 200(.75) = 150. Since NP is an integer, the 75th percentile is the average of the 150th and 151st observations starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 11 / 34
Percentiles Calculating Percentiles Example Let the sample size be N=35. percentiles. Calculate the 25th, 50th and 75th 25th percentile (p=0.25) : Np = 35(.25) = 8.75. Np is an Not an Integer. Since NP is an NOT an integer, the 25th percentile would be found as the 1 + the largest integer in Np. For example, NP=8.75 and the largest integer contained in 8.75 is 8. Add one to this value, ie. 8 + 1 =9, and the value of the 25th percentile is the 9th value starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 12 / 34
Percentiles Calculating Percentiles Example Let the sample size be N=35. Calculate the 50th. 50th percentile (p=0.50): Np = 35(.5) =17.5, which is Not an integer. Since NP is an NOT an integer, the 50th percentile would be found as the 1 + the largest integer in Np. NP=17.5 and the largest integer contained in 17.5 is 17. Add one to this value, ie. 17 + 1 = 18, and the value of the 50th percentile is the 18th value starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 13 / 34
Percentiles Calculating Percentiles Example Let the sample size be N=35. Calculate the 75th. 75th percentile (p=0.75) : Np = 35(.75) = 26.25. Np is Not an integer. Since NP is an NOT an integer, the 75th percentile would be found as the 1 + the largest integer in Np. NP=26.25 and the largest integer contained in 26.25 is 26. Add one to this value, ie. 26 + 1 = 27, and the value of the 75th percentile is the 27th value starting from the smallest value. Biostatistics (LSUHSC) Chapter 2 05/10 14 / 34
Sample Variance Sample Variance The variance and its square root, the standard deviation, are the most widely used measures of variability. all of the data is used. the mean is used as a reference point. only positive values are possible S 2 = 1 N 1 ( X i X ) 2 definitional form ( S 2 = 1 N 1 Xi 2 NX 2) computational form Standard deviation is the Square Root of the Variance: S = S 2 Biostatistics (LSUHSC) Chapter 2 05/10 15 / 34
Sample Variance Computing the sample variance and standard deviation using R data<-read.table("c:\\table2_1.txt",header=t) > attach(data) > var(birthwgt) [1] 198323.6 > sd(birthwgt) [1] 445.3353 S 2 = 1 N 1 ( X 2 i NX 2) = > (1/(length(birthwgt)-1))*(sum(birthwgt^2)- +length(birthwgt)*mean(birthwgt)^2) [1] 198323.6 Biostatistics (LSUHSC) Chapter 2 05/10 16 / 34
Computing the sample variance using R N 1 > nm1<-1/(length(birthwgt)-1) [1] 0.05263158 X 2 i > sum.x2<-sum(birthwgt^2) [1] 204353260 N [1] 20 > n<-length(birthwgt) > xbar2<-mean(birthwgt)^2 > xbar2 [1] 10029256 X 2 Computing ( sample variance: S 2 = 1 N 1 Xi 2 NX 2) = > nm1*(sum.x2-n*xbar2) [1] 198323.6 Biostatistics (LSUHSC) Chapter 2 05/10 17 / 34
Coeffi cient of Variation Coeffi cient of Variation (CV) The coeffi cient of variation is a unitless measure of variability that is the ratio of the standard deviation to the mean. useful for comparing variability of datasets measured in different units only useful for data on ratio scale of measurement. CV = S X 100% R-Code for computing the C.V. > 100*sd(birthwgt)/mean(birthwgt) [1] 14.06219 Biostatistics (LSUHSC) Chapter 2 05/10 18 / 34
Graphics: Scatter Plots R-Code > plot(fwtright,fwtleft,main="scatter Plot") Biostatistics (LSUHSC) Chapter 2 05/10 19 / 34
Stem and Leaf Plots Stem and Leaf Plots Constructed from ordered array of data.. > sort(birthwgt) [1] 2069 2581 2759 2834 2838 2841 3031 3101 3200 3245 3248 3260 3265 3314 3323 3484 3541 3609 3649 4146 > stem(birthwgt) The decimal point is 3 digit(s) to the right of the 2 1 2 68888 3 012223333 3 5566 4 1 Biostatistics (LSUHSC) Chapter 2 05/10 20 / 34
Stem and Leaf Plots > stem(rnorm(n=200,mean=2.5,sd=.5)) The decimal point is 1 digit(s) to the left of the 12 33 14 035 16 00567067788899 18 00144668901123333366 20 011244677890134445566778889 22 0000023444668890023344556677788999 24 1112226777888912335567 26 0000011133555679999124446668 28 001114477801222222556667 30 67123589 32 0034673448 34 2658 36 356 Biostatistics (LSUHSC) Chapter 2 05/10 21 / 34
Box Plots Biostatistics (LSUHSC) Chapter 2 05/10 22 / 34
Side by Side Box Plots Data set Lead.txt from Rosner s CD: R-code: boxplot(fwt_r~sex) Biostatistics (LSUHSC) Chapter 2 05/10 23 / 34
Box Plots Box Plots Based on quartiles of sample data: 25th (Q1), 50th (Q2), and 75th (Q3) percentiles. Step 1: Draw number line scale encompassing the range of he data. Step 2: Compute quartiles Q1, Q2, and Q3 (see section on calculating percentiles). Step 3: Draw box above number line from Q1 to Q3. Step 4: Draw vertical hash within box at Q2. Step 5: Determine outliers as points further than 1.5*(Q1-Q3) from ends of box. Step 6: Extend "whiskers" to largest (smallest) observations not outliers. Step 7: Draw small circles to represent outliers Biostatistics (LSUHSC) Chapter 2 05/10 24 / 34
Box Plots Example We will construct a box plot from a sample of n=10 observations taken as a random sample from a larger data set containing n=100 observations. y <- sample(y2,10) sort(y) 2.000298 2.179386 2.236754 2.249943 2.342899 2.465442 2.596121 2.741588 2.772203 2.834250 summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.240 2.404 2.442 2.705 2.834 par(plt=c(0.2,0.5,.6,.9)) boxplot(y,main="box Plot of y",xlab="y") Biostatistics (LSUHSC) Chapter 2 05/10 25 / 34
Box Plots The quartiles were easily obatined using R on previous slide. Alternately, we could use the method of computing percentiles on the data: 2.000298 2.179386 2.236754 2.249943 2.342899 2.465442 2.596121 2.741588 2.772203 2.834250 Method of calculating percentiles: 25th percentile: np = 10(.25) = 2.5. Since not an integer, add one to largest integer in 2.5 = 2+1 =3. The 25th percentile (Q1) is the 3rd observation from the left (when sorted in ascending order) = 2.236754 Likewise, Q3 is the 3rd obs from the right end = 2.741588 To determine the median (Q2): np=10*(.5) = 5. Since np is an integer, Q2 is the average of npth and npth +1 obs = (2.342899 2.465442)/2 = 2.4042 Biostatistics (LSUHSC) Chapter 2 05/10 26 / 34
Box Plots Biostatistics (LSUHSC) Chapter 2 05/10 27 / 34
Graphics: Histograms Biostatistics (LSUHSC) Chapter 2 05/10 28 / 34
Histogram with Normal Distribution Curve Biostatistics (LSUHSC) Chapter 2 05/10 29 / 34
Histograms Based on frequency distribution obtained by categorizing a continuous variable. Step 1: Create categorical ranges (bins) of equal size by dividing range of the data by # bins Step 2: Obtain frequency distribution for # obs per bin Step 3: Plot histograms with height of rectangles proportional to bin frequency. R-Code: hist(y) which can be embellished with titles and axis labels: hist(y,main="histogram of y",xlab="y") Biostatistics (LSUHSC) Chapter 2 05/10 30 / 34
Histogram Frequency Table Category (bin) Frequency 2.0-2.2 2 2.2-2.4 3 2.4-2.6 2 2.6-2.8 2 2.8-3.0 1 Biostatistics (LSUHSC) Chapter 2 05/10 31 / 34
Histogram Biostatistics (LSUHSC) Chapter 2 05/10 32 / 34
Data Graphics Biostatistics (LSUHSC) Chapter 2 05/10 33 / 34
R Code R Code for Generating 4-Panel Graphics data<-read.table("c:\\table2_1.txt",header=t) attach(data) par(plt=c(0,0.5,.5,1.0)) par(mfrow=c(2,2)) par(fig=c(0.05,.25,.8,.95)) plot(birthwgt,xlab="") par(fig=c(.25,.45,.8,.95),new=t) boxplot(birthwgt,xlab="") par(fig=c(0.05,.25,.6,.8),new=t) hist(birthwgt,main="",xlab="") par(fig=c(.25,.45,.6,.8),new=t) qqnorm(birthwgt,xlab="") qqline(birthwgt,lty=2) Biostatistics (LSUHSC) Chapter 2 05/10 34 / 34