1 Measures of the Center of a Distribution

Size: px

Start display at page:

Download "1 Measures of the Center of a Distribution"

Oliver Austin
5 years ago
Views:

1 1 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects of unimodal distributions that we will often want to measure are center (what is a typical value? where do the values cluster?), and the amount of variation (are the data tightly clustered around a central value, or more spread out?) Two widely used measures of center are the mean and the median. You are probably already familiar with both. The mean is calculated by adding all the values of a variable and dividing by the number of values. Our usual notation will be to denote the n values as x 1, x 2,... x n, and the mean of these values as x. Then the formula for the mean becomes n x = x i. n The median is a value that splits the data in half half of the values are smaller than the median and half are larger. By this definition, there could be more than one median (when there are an even number of values). This ambiguity is removed by taking the mean of the two middle numbers (after sorting the data). Whereas x denotes the mean of the n numbers x 1,..., x n, we use x to denote the median of these numbers. The mean and median are easily computed in R. > x=c(1,1,5,20,10) > mean(x) [1] 7.4 > median(x) [1] Comparing mean and median While both the mean and median provide a measure of the center of a distribution, they measure different things and sometimes one measure is better than the other. If a distribution is (approximately) symmetric, the mean and median will be (approximately) the same. If the distribution is not symmetric, however, the mean and median may be very different. For example, if we begin with a symmetric distribution and add in one additional value that is very much larger than the other values (an outlier), then the median will not change very much (if at all), but the mean will increase substantially. Because of this, we say that the median is resistant to outliers while the mean is not. A similar thing happens with a skewed, unimodal distribution. If a distribution is positively skewed, the large values in the tail of the distribution increase the mean (as compared to a symmetric distribution) but not the median, so the mean will be larger than the median. Similarly, the mean of a negatively skewed distribution will be smaller than the median. Consider the data on the populations of the 3,141 county equivalents in the United States. From R we see the great difference in the mean county population and the median county population. Note that the largest county, Los Angeles County with over 9 million people, alone contributes over 3,000 people to the mean. Whether a resistant measure is desirable or not depends on context. If we are looking at the income of employees of a local business, the median may give us a much better indication of what a typical worker earns, since there may be a few large salaries (the business owner s, for example) that inflate the mean. This is also why the government reports median household income and median housing costs. The median county population perhaps tells us more about what a typical county looks like than does the mean. On the other hand, if we are ultimately interested in the total, the mean is more useful. For example, the median of daily sales of a Hallmark Card store will likely be smaller than the mean as there are several days of card sales that are outliers (e.g., the few days before Mother s Day). The mean daily sales for a year allows us also to compute the total sales on the year and comparing the means of two different stores allows us to determine which store has greater sales overall.

2 1.2 The trimmed mean There is another measure of center that is less well known and represents a kind of compromise between the mean and the median. In particular, it is more sensitive to the the extreme values of a distribution than the median is, but less sensitive than the mean. The idea of a trimmed mean is very simple. Before calculating the mean, we remove the largest and smallest values from the data. The percentage of the data removed from each end is called the trimming percentage. The 10% trimmed mean is the mean of the middle 80% of the data (after removing the largest and smallest 10%). A trimmed mean is calculated in R by setting the trim argument of mean(), e.g. mean(x,trim=.10). Although a trimmed mean in some sense combines the advantages of both the mean and median, it is less common than either the mean or the median. This is partly due the mathematical theory that has been developed for working with the median and especially the mean of sample data. The 10% trimmed mean of county populations is 38,234 which is much closer in size to the median than to the mean. We note in passing that there are some complications in defining the trimmed mean. For instance in the case of the example above, the 10% trimmed mean of the county populations should be computed by removing the largest most populus and the least populus counties from the data. But this doesn t make any sense so that there needs to be some convention for dealing with these fractions of data points. In fact there are several different conventions but we will let R handle the details of the computation and not be too concerned about the differences. With large datasets, the differences between the various definitions of, say, the 10% trimmed mean are small. In some sports, the trimmed mean is used to compute a competitors score based on the scores given by individual judges. Diving, figure skating, and gymnastics are three sports that use a trimmed mean to compute a competitors final score. (Diving uses the middle three scores when there are five judges which amounts to computing the 20% trimmed mean.) 2 Measures of Dispersion It is often useful to characterize a distribution in terms of its center, but that is not the whole story. Consider the distributions depicted in the histograms below A B 0.15 Density In each case the mean and median are approximately 10, but the distributions clearly have very different shapes. The difference is that distribution B is much more spread out. Almost all of the data in distribution A are quite close to 10; a much larger proportion of distribution B is far away from 10. The intuitive (and not very precise) statement in the preceding sentence can be quantified by means of quantiles. The idea of quantiles is probably familiar to you since percentiles are a special case of quantiles. Definition 2.1 (Quantile). Let p [0, 1]. A p-quantile of a quantitative distribution is a number q such that the (approximate) proportion of the distribution that is less than q is p.

3 So for example, the.2-quantile divides a distribution into 20% below and 80% above. The.2-quantile is also called the 20th percentile. The median is just the.5-quantile (and the 50th percentile). While the definition of quantile above seems clear, it does have the same complication as that of the definition of trimmed mean. Suppose your data set has 15 values. What is the.30-quantile? 30% of the data would be (.30)(15) = 4.5 values. Of course, there is no number that has 4.5 values below it and 10.5 values above it. This is the reason for the parenthetical word approximate in Definition??. Different methods have been proposed for giving quantiles a single value, and R implements 9 different methods! The next example illustrates the default computation of the.25-quantile for datasets of size 5, 6, 7 and 8. > quantile(1:5,.25) 2 > quantile(1:6,.25) 2.25 > quantile(1:7,.25) 2.5 > quantile(1:8,.25) 2.75 Fortunately, for large data sets, the differences between the various different quantile methods are usually small, so we will just let R compute quantiles for us using the default method of the quantile() function. The difference between the first and third quartiles is often used as a simple measure of dispersion. This measure is called the inter-quartile range and abbreviated IQR. The IQR of the Old Faithful eruption times is = > quantile(faithful$eruptions,.75) 75% > quantile(faithful$eruptions,.25) > IQR(faithful$eruptions) [1] Note that since IQR depends only on the middle 50% of data, it is a measure of dispersion that is resistant to outliers. Especially for hand computation, yet another method of computing the quartiles based on the median is popular. With this method, the first- and third-quartiles (the.25-quantile and the.75-quantile respectively) are called the lower hinge and the upper hinge respectively. These are computed by the following definition. Definition 2.2 (hinges). Suppose that a variable x 1,..., x n has an even number of values, say n = 2k. Then the lower hinge is the median of the smallest k values and the upper hinge is the median of the largest k values. If the variable has an odd number of values, n = 2k + 1, then the lower hinge is the median of the smallest k + 1 values and the upper hinge is the median of the largest k + 1 values. In other words, the lower hinge is the median of the lower half of the data with the middle point included in that half if there are an odd number of data points. Similarly for the upper hinge. A very common and useful description of the variability in a distribution is the five number summary. The five number summary consists of the minimum, lower hinge, median, upper hinge, and maximum of the distribution. The five number summary is computed by the R function fivenum().

4 The five-number summary is often presented graphically by means of a boxplot. The standard R function is boxplot(). A boxplot of the Sepal.Width of the iris data is generated by > boxplot(iris$sepal.width,horizontal=t) The sides of the box are drawn at the hinges. The median is represented by a dot or line in the box. In some boxplots, the whiskers extend out to the maximum and minimum values. However the boxplot that we are using here attempts to identify outliers. Outliers are values that are unusually large or small and are indicated by a special symbol beyond the whiskers. The whiskers are then drawn from the box to the largest and smallest non-outliers. One common rule for automating outlier detection for boxplots is the 1.5 IQR rule. This is the default rule in both boxplot functions in R. Under this rule, any value that is more than 1.5 IQR away from the box is marked as an outlier. Indicating outliers in this way is useful since it allows us to see if the whisker is long only because of a few extreme values. A boxplot gives us some idea of the symmetry and general dispersion of a variable but it certainly doesn t give us as much information about the shape of a distribution as a histogram. 2.1 Variance and Standard Deviation Another important way to measure the dispersion of a distribution is by comparing each value to the center of the distribution. If the distribution is spread out, these differences will tend to be large, otherwise these differences will be small. To get a single number, we could simply add up all of the deviation from the mean: total deviation from the mean = (x i x). The trouble with this is that the total deviation from the mean is always 0. deviations and the positive deviations always exactly cancel out. To fix this problem we might consider taking the absolute value of the deviations from the mean: total absolute deviation from the mean = x i x. The problem is that the negative This number will only be 0 if all of the data values are equal to the mean. Even better would be to divide by the number of data values. Otherwise large data sets will have large sums even if the values are all close to the mean. mean absolute deviation mean absolute deviation = 1 x i x. n This is a reasonable measure of the dispersion in a distribution, but we will not use it very often. There is another measure that is much more common, namely the variance, which is defined by variance variance = VAR(x) = 1 n 1 (x i x) 2.

5 You will notice two differences from the mean absolute deviation. First, instead of using an absolute value to make things positive, we square the deviations from the mean. One advantage of squaring over the absolute value is that it is much easier to do calculus with a polynomial than with functions involving absolute values. The second difference is that we divide by n 1 instead of by n. There is a good reason for this, even though dividing by n seems more natural. We will get to that reason later. However the following principle helps to remember the number n 1. If we consider the n 1 deviations (x 1 x), (x 2 x),..., (x n x), any one of them is determined from the other n 1 (because of the property of the mean that the sums of these deviations is 0). Therefore there are only n 1 degrees of freedom in these numbers rather than the n degrees of freedom in the data. Indeed, n 1 is called degrees of freedom for this reason. Because the squaring changes the units of this measure, the square root of the variance, called the standard deviation, is commonly used in place of the variance. standard deviation = SD(x) = VAR(x). We will sometimes use the notation s x and s 2 x for the standard deviation and variance respectively. The subscript x refers to the particular variable for which we are computing the variance or standard deviation and we sometimes omit it (and write s or s 2 ) when it is clear what variable is involved. All of these quantities are easy to compute in R.

Describing Distributions with Numbers

Describing Distributions with Numbers Using graphs, we could determine the center, spread, and shape of the distribution of a quantitative variable. We can also use numbers (called summary statistics)