Chapter 1 Descriptive Statistics

MICHIGAN STATE UNIVERSITY STT 351 SECTION 2 FALL 2008 LECTURE NOTES Chapter 1 Descriptive Statistics Nao Mimoto Contents 1 Overview 2 2 Pictorial Methods in Descriptive Statistics 3 2.1 Different Kinds of Plots............................ 3 2.2 How to draw Stem-and-leaf plot, dot plot and histogram.......... 6 2.3 Shapes of histogram.............................. 10 3 Measures of Location 11 3.1 Mean and Median................................ 11 3.2 Quantiles, percentiles............................. 13 4 Measures of Variability 14 4.1 Sample Variance................................ 14 4.2 Five number summary and Boxplots..................... 16 1

Lecture notes for Devore 7ed. Chapter 1 2 1 Overview Population: our body of interest. Sample: a subset of population chosen in some ramdom manner. Data: Collection of facts, numbers, and measurements. Univariate, bivariate, and multivariate data. Discrete, and continuous variable Inferential Statistics: generalizes the information gained from a sample to a population. Descriptive Statistics: Summarize and describe important feasure of the data. Stem-and-leaf plot Dotplot Scatter plot Histograms Boxplots Mean Median Quantiles, percentiles, trimmed means Outlier Sample variance

Lecture notes for Devore 7ed. Chapter 1 3 2 Pictorial Methods in Descriptive Statistics 2.1 Different Kinds of Plots (Example 1.2 from p.5) Material strength investigations. Flexural strength of high performance concrete (in MegaPascal) 5.9 7.2 7.3 6.3 8.1 6.8 7.0 7.6 6.8 6.5 7.0 6.3 7.9 9.0 8.2 8.7 7.8 9.7 7.4 7.7 9.7 7.8 7.7 11.6 11.3 11.8 10.7 Population: Sample: Data: Univariate, continuous variable. Stem-and-leaf plot 5 9 6 33588 7 00234677889 8 127 9 077 10 7 11 368 Dotplot

Lecture notes for Devore 7ed. Chapter 1 4 Scatter plot (Chapter 12) Histogram

Lecture notes for Devore 7ed. Chapter 1 5 Box Plot

Lecture notes for Devore 7ed. Chapter 1 6 2.2 How to draw Stem-and-leaf plot, dot plot and histogram Raw Data: 5.9 7.2 7.3 6.3 8.1 6.8 7.0 7.6 6.8 6.5 7.0 6.3 7.9 9.0 8.2 8.7 7.8 9.7 7.4 7.7 9.7 7.8 7.7 11.6 11.3 11.8 10.7 1. sort the data Sorted Data: 5.9 6.3 6.3 6.5 6.8 6.8 7.0 7.0 7.2 7.3 7.4 7.6 7.7 7.7 7.8 7.8 7.9 8.1 8.2 8.7 9.0 9.7 9.7 10.7 11.3 11.6 11.8 2. Decide on class intervals (bin width), group accordingly 5.9 6.3 6.3 6.5 6.8 6.8 7.0 7.0 7.2 7.3 7.4 7.6 7.7 7.7 7.8 7.8 7.9 8.1 8.2 8.7 9.0 9.7 9.7 10.7 11.3 11.6 11.8 3. Format them 5 9 6 33588 7 00234677889 8 127 9 077 10 7 11 368 (idea is the same in dot plot or histogram)

Lecture notes for Devore 7ed. Chapter 1 7 Example (Problem 1-13 on p.20) Tensile ultimate strength (ksi) of metallic aerospace vehicles 122.2 124.2 124.3 125.6 126.3 126.5 126.5 127.2 127.3 127.5 127.9 128.6 128.8 129.0 129.2 129.4 129.6 130.2 130.4 130.8 131.8 131.4 131.4 131.5 131.6 131.6 131.8 131.8 132.3 132.4 132.4 132.5 132.5 132.5 132.5 132.6 132.7 132.9 133.0 133.1 133.1 133.1 133.1 133.2 133.2 133.2 133.3 133.3 133.5 133.5 133.5 133.8 133.9 134.0 134.0 134.0 134.0 134.1 134.2 134.3 134.4 134.4 134.6 134.7 134.7 134.7 134.8 134.8 134.8 134.9 134.9 135.2 135.2 135.2 135.3 135.3 135.4 135.5 135.5 135.6 135.6 135.7 135.8 135.8 135.8 135.8 135.8 135.9 135.9 135.9 135.9 136.0 136.0 136.1 136.2 136.2 136.3 136.4 136.4 136.6 136.8 136.9 136.9 137.0 137.1 137.2 137.6 137.6 137.8 137.8 137.8 137.9 137.9 138.2 138.2 138.3 138.3 138.4 138.4 138.4 138.5 138.5 138.6 138.7 138.7 139.0 139.1 139.5 139.6 139.8 139.8 140.0 140.0 140.7 140.7 140.9 140.9 141.2 141.4 141.5 141.6 142.9 143.4 143.5 143.6 143.8 143.8 143.9 144.1 144.5 144.5 147.7 147.7 1. Sort the data: This data is already sorted. 2. Decide on bin width, group accordingly: Let s say we devide 122 to 148 with equal bin width of 2. We have 13 intervals. To get relative frequency, we devide frequency by total number of observations 153. relative frequency = frequency total number of observations class intervals frequency relative frequency 122 - <124 1 0.0065 124 - <126 3 0.0196 126 - <128 7 0.0458 128 - <130 6 0.0392 130 - <132 11 0.0719 132 - <134 29 0.1895 134 - <136 36 0.2353 136 - <138 20 0.1307 138 - <140 20 0.1307 140 - <142 8 0.0523 142 - <144 7 0.0458 144 - <146 3 0.0196 146 - <148 2 0.0131

Lecture notes for Devore 7ed. Chapter 1 8 3. Format them: You can draw your histogram using frequency or relative frequency. Below two historams are the same except the scale on Y-axis.

Lecture notes for Devore 7ed. Chapter 1 9 4. If you use different class intervals: All these three histograms are drawn using same data. Note how the choice of class interval affects the shape of histogram.

Lecture notes for Devore 7ed. Chapter 1 10 2.3 Shapes of histogram There are names to describe the general shape of histogram. unimodal, multimodal, symmetric, positively skewed, nevgatively skewed.

Lecture notes for Devore 7ed. Chapter 1 11 3 Measures of Location 3.1 Mean and Median For a sample of size n, x 1, x 2, x 3,..., x n, we wish to represent location of the data by one simple numbers. We can use sample mean, which is just an average of the observations; n i=1 x = x i. n Or sample median, which is a middle guy in the observations; { n+1 th ordred observations if n is odd 2 x = average of n th and n + 1 th ordered observations if n is even 2 2 That is, if n = 9, then x is the 5th ordred observations. if n = 10, then the sample median is an average over 5th and 6th orderd observation. Why there s mean and median? One reason is that mean is very sensitive to outliers. In other words, by just one big number can change mean by a lot. On the other hand, median is insensitive to outliers. Another reason is that mean sometime is not good measure of average or middle observation in the data. Below is a example of that. Example (Problem 1-27 on p.24) Study on the life distribution of microdrills. Number of holes that a drill machines before it breaks. 11 14 20 23 31 36 39 44 47 50 59 61 65 67 68 71 74 76 78 79 81 84 85 89 91 93 96 99 101 104 105 105 112 118 123 136 139 141 148 158 161 168 184 206 248 263 289 322 388 513 So we have n = 50 x = 119.26 x = average of 25th and 26th ordered observation = (91 + 93)/2 = 92 1. Mean is sensitive to outlier Imagine somebody typed 5013 instead of 513 by mistake. Now your mean is 209.26, but the median remains unchanged. 2. Mean is not always an average guy In some cases, it may be somewhat misleading to use mean as your average number

Lecture notes for Devore 7ed. Chapter 1 12 to represent your data. According to our data, only 16 drills out of 50 drilled more than the mean of 119.26 holes. On the other hand, by definition, half of our sampled drills machined less than the median of 92 holes, and half of them drilled more than 92.

Lecture notes for Devore 7ed. Chapter 1 13 3.2 Quantiles, percentiles 1st quantile is a median of smaller half. Include median in the half if n is odd. 1st quantile is also called lower fourth, or 25th percentile. 2nd quantile is same as median. Median is also called 50th percentile. 3rd quantile is a median of larger half of data. Include median to the half if n is odd. 3rd quantile is also called upper fourth, or 75th percentile. Example If data looks like 1 2 3 4 5 6 7 8 9 10 11 12, with 12 observations, the median is 6.5. Now we break the data into two halves and get 1 2 3 4 5 6 }{{} smaller half } 7 8 9 10 {{ 11 12}. larger half 1st quartile is a median of the smaller half, which is 3.5. 3rd quartile is a median of the larger half, which is 9.5. Example If data looks like 1 2 3 4 5 6 7 8 9 10 11 12 13, With 13 observations, the median is 7. Since we have odd number of observations, we include 7 in both smaller half and larger half. 1 2 3 4 5 6 7 }{{} smaller half 7 8 9 10 11 12 13 }{{} larger half Then, 1st quartile is a median of the smaller half, which is 4. 3rd quartile is a median of the larger half, which is 10.

Lecture notes for Devore 7ed. Chapter 1 14 4 Measures of Variability 4.1 Sample Variance Now we wish to represent the spread or variability of data by a number. To do that we use sample vaiance, n s 2 i=1 = (x i x) 2 n 1 Numerator is called sum of squared deviations, = S xx n 1 S xx = n (x i x) 2. i=1 Notice that we are dividing by n 1 instead of n. Sample standard deviation is defined as s = s 2. Example i x i x i x (x i x) 2 1 87-26.25 689.06 2 103-10.25 105.06 3 130 16.75 280.56 4 160 46.75 2185.56 5 129 15.75 248.06 6 105-8.25 68.06 7 99-14.25 203.06 8 93-20.25 410.06 x = 113.25 n i=1 (x i x) 2 = 4189.5 In this case n = 8. Therefore, the sample variance and sample standard deviation are s 2 = 4189.5 8 1 = 598.5 s = 598.5 = 24.464

Lecture notes for Devore 7ed. Chapter 1 15 There s another formula for S xx that is easier to compute if you are using hand-held calculators. Example S xx = n (x i x) 2 = i=1 n i=1 x 2 i ( n i=1 x i) 2 n i x i x 2 i 1 87 7569 2 103 10609 3 130 16900 4 160 25600 5 129 16641 6 105 11025 7 99 9801 8 93 8649 n i=1 x i= 906 n i=1 x2 i = 106794 The sum of squared deviations can be calculated as S xx = n i=1 x 2 i ( n i=1 x i) 2 n = 106794 (906)2 8 = 4189.5. Therefore, the sample variance and sample standard deviation are s 2 = 4189.5 8 1 = 598.5 s = 598.5 = 24.464

Lecture notes for Devore 7ed. Chapter 1 16 4.2 Five number summary and Boxplots Boxplot is another way to pictorially summarise data. Boxplot is drawn using five number summary. Five number summary is consisted of minimum observation, lower fourth, median, upper fourth, Maximum observation Example (Problem 1-54 on p.40) Shear strength(mpa) of a joint 4.4 16.4 22.2 30.0 33.1 36.6 40.4 66.7 73.7 81.5 109.9 There are 11 observations. Minimum is 4.4. Maximum is 109.9. Median is 36.6. Lower fourth is median of smaller half, {4.4 16.4 22.2 30.0 33.1 36.6} so it s (22.2+30.0)/2 = 26.1 Upper fourth is median of larger half, {36.6 40.4 66.7 73.7 81.5 109.9} so it s (66.7+73.7)/2 = 70.2 So our five number summary looks like Min lower fourth median upper fourth Max 4.4 26.1 36.6 70.2 109.9 The box width f x is defined as f x = upper fourth lower fourth. Now we can draw our boxplot using those five numbers.

Lecture notes for Devore 7ed. Chapter 1 17 Boxplot with outliers Observations farther than 1.5 box width away from the closest fourth is an outlier. If it is more than 3 box width away from the nearest fourth, it s called extreme outlier. Otherwise it is called an mild outlier. Example (Ex. 1.14 on p.28) 2.0 2.4 2.5 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3 3.4 3.4 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1 We have 24 observations. Mean is 3.7. Min lower fourth median upper fourth Max 2.000 2.775 3.350 3.875 10.100