Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore Chapter 3 continued Describing distributions with numbers Measuring spread of data: Quartiles Definition 1: The interquartile range (IQR) of a set of measurements is defined to be the difference between the upper and lower quartiles, i.e. IQR = Q 3 Q 1. As we have seen from box-and-whisker plots, the interquartile range is especially useful when comparing the spreads of two distributions. The IQR can also be used to detect outliers: Example 1: The 1.5 IQR criterion. A common criterion for detecting suspected outliers in a data set is as follows: Call an observation an outlier if it falls more than 1.5 IQR above the third quartile or below the first quartile. The data on the volume of acorns (in cubic centimeters) from 39 species of oaks are given in today s Minitab worksheet. Use a stem-and-leaf plot to find the outliers. Then see whether these satisfy the 1.5 IQR criterion. Measuring spread of data: variance and standard deviation Recall that x denotes the mean of a set x 1,...,x n of observations. Definition 2: Deviations The deviations of the data set x 1,...,x n are the numbers x 1 x, x 2 x,...,x n x Definition 3: Variance The variance s 2 of the data set x 1,...,x n is s 2 = (x 1 x) 2 +(x 2 x) 2 +...+(x n x) 2 n 1 = 1 X (xi x) 2. n 1 1
Definition 4: Standard deviation The standard deviation s of the data set x 1,...,x n is the nonnegative square root of the variance, i.e. r 1 X s = (xi x) n 1 2. Why we divide by n 1 when computing s and s 2. We denote by σ 2 the variance of measurements for a whole population, while s 2 is used to denote the variance of the measurements from a sample of the population. Suppose that we wanted to estimate the variance σ 2 of the heights of all the adults in the world. Obviously we can t compute σ 2 exactly, but we can compute the variance s 2 of a random sample of the population. We hope that s 2 will be close to σ 2. In fact, let s suppose that we select many random samples, and compute the variances s 2 1,s2 2,... for each sample using the formula on the preceding page. Then the average of s 2 1,s 2 2,... would be close to σ 2. For this reason, s 2 is called an unbiased estimator for σ 2. On the other hand, suppose that we computed s 2 by dividing by n instead of n 1. Then the average of s 2 1,s 2 2,... would underestimate σ 2. Properties of the standard deviation s measures spread about the mean and should be used only when the mean is chosen as the measure of center. s =0only if there is no spread, which happens only when all the observations have the same value. As the observations become more spread out about their mean, s gets larger. s has the same units as the original observations. For example, if the data set is weights of people in pounds, then s also has units of pounds. This is one reason to prefer s to s 2, which has units of pounds squared. s is not resistant (to outliers). Strong outliers or skewness can greatly increase s. Choosing a summary of data The five number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x and s only for reasonably symmetric distributions that are free of outliers. A graph gives the best overall picture of a distribution. There are certain features of a distribution, such as gaps, that are not revealed by numerical summaries. Always plot your data. 2
Example 2: Roger Maris. New York Yankee Roger Maris held the singleseason home-run record from 1961 until 1998. Here are Maris s home run counts for his 10 years in the American League (these are also in today s Minitab worksheet): 14, 28, 16, 39, 61, 33, 23, 26, 8, 13. (a) Make a stem-and-leaf plot of the data. Which is the outlier? (b) Use Minitab to find x and s. (c) Now find x and s for the 9 observations that remain when you leave out the outlier. How does the outlier affect the values of x and s? Example 3: State SAT scores. Average SAT scores for the states and the District of Columbia are given in today s worksheet. Find the basic statistics for both the math and verbal scores separately. Then construct stem-and-leaf plots for the math and verbal scores separately. What important feature of the distributions do the numerical summaries fail to reveal? The Empirical Rule Suppose that a data set has a "mound" or "bell-shaped" histogram. This means that the histogram has a single peak, is symmetric, and tapers off gradually in the tails. Let x be the mean and s be the standard deviation of the data. Then the Empirical Rule, or 68-95-99.7 rule, says that 68% of the data lies between x s and x + s 95% of the data lies between x 2s and x +2s 99.7% of the data lies between x 3s and x +3s 3
Example 4: A histogram of the heights of 1000 women aged 18 to 24 years of age was found to have a bell shape. Also, the mean and standard deviation of the heights are 64.5 inches and 2.5 inches, respectively. (a) About how many of the women are taller than 66 inches? (b) About how many of the women are taller than 59.5 inches but shorter than 66 inches? Summarizing Data from More Than One Variable Contingency table Also called an r c contingency table, where r =number of rows and c =number of columns. Used to summarize data from two qualitative (i.e. categorical) variables. 4
Example 5: A company operates four machines three shifts each day. From production records, the following data on the number of breakdowns are collected. Thisisa3 4 contingency table. Number of breakdowns Stacked bar graph Machines Shift A B C D 1 41 20 12 16 2 31 11 9 14 3 15 17 16 10 Example 6. Refer to the preceding table. For each machine separately, we want to display the percentages of breakdowns of that machine that occured in shifts 1, 2, and 3. To do this we can use a stacked bar graph. First, the tables below are computed. In the second table, each column contains the percentages of breakdowns that occured in shifts 1, 2, 3, for that particular machine. Number of breakdowns Machines Shift A B C D 1 41 20 12 16 2 31 11 9 14 3 15 17 16 10 Total 87 48 37 40 Percentages of breakdowns Machines Shift A B C D 1 47.1 41.7 32.4 40 2 35.6 22.9 24.3 35 3 17.2 35.4 43.2 25 Total 99.9 100 99.9 100 Now, to make the stacked bar graph, place A, B, C, D on the horizontal axis. For each of A, B, C, D, stack three blocks whose heights equal the percentages for shifts 1, 2, and 3. 5
Cluster bar graph A cluster bar graph displays the relationship between a combination of quantitative variables and a single qualitative (categorical) variable. The qualitative variables go on the horizontal axis and the quantitative variable goes on the vertical axis. Example 7: Majors for men and women. A study of the career plans of women and men was made. One question asked which major the student had chosen. Here are the data: Female Male Accounting 68 56 Administration 91 40 Economics 5 6 Finance 61 59 Make a cluster bar graph of the data, where each cluster of bars corresponds to a major. What is another way to make a cluster bar graph of this data? 6
Scatterplots A scatterplot is used to display the relationship between two quantitative variables. Definition 5: Explanatory and response variables. Given a pair of related variables, the variable that causes changes in the other variable is called the explanatory variable. The other variable is called the response variable. Example 8: There is a relationship between the altitude of a city and the air pressure in that city. Which variable is the explanatory variable and which variableistheresponsevariable? In a scatterplot, we place the explanatory variable on the horizontal axis and theresponsevariableontheverticalaxis. Example 9: Heating a home. For each of 16 months, a household records average natural gas consumption (in hundreds of cubic feet) and then number of degree-days for that month (one degree day is accumulated for each degree a day s average temperature falls below 65. An average temperature of 20 F, for example, corresponds to 45 degree days). The data is given in today s Minitab worksheet. Make a scatterplot of the data. Examining a scatterplot Look for the overall pattern and for striking deviations from the pattern. You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship. An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship. Positive association and negative association Two variables are positively associated when above-average values of one tend to accompany above-average values of the other and below-average values also tend to occur together. Two variables are negatively associated when above-average values of one tend to accompany below-average values of the other, and vice-versa. Example 10: Thoroughly describe the scatter plot from example 9. 7