Sixth Edition LI C u s s s S C CONCEPTS AND APPLICATIONS Mark L. Berenson David M. Levine Department ofstatistics and Computer Information Systems Baruch College, City University ofnew York Prentice Hall, Upper Saddle River, New Jersey
LIBRARY OF CONGRESS CATALOGING-IN-PUBLICATION DATA Berenson, Mark L. Basic business statistics : concepts and applications Mark L. Berenson, David M. Levine.-6th ed. p. cm. Includes bibliographical references and index. ISBN 0-13-303009-1 1. Commercial statistics. 2. Statistics. I. Levine, David M. HF1017.B38 1996 519.5 dc 20 94-12551 CIP Acquisitions Editor: Tom Tucker Production Editor: Katherine Evancie Managing Editor: Joyce Turner Cover Designer: Sue Behnke Interior Design: Ed Smith Design Director: Patricia H. Wosczyk Buyer: Marie McNamara Assistant Editor: Diane Peirano Production Assistant: Renee Pelletier Marketing Manager: Susan McLaughlin Cover art: Marjory Dressler 1996, 1992, 1989, 1986, 1983, 1979 by Prentice Hall, Inc. Simon & Schuster/A Viacom Company Upper Saddle River, New Jersey 07458 All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 ISBN 0-13-303009-1 Prentice-Hall International (UK) Limited, London Prentice-Hall of Australia Pty. Limited, Sydney Prentice-Hall Canada Inc., Toronto Prentice-Hall Hispanoamericana, S.A., Mexico Prentice-Hall of India Private Limited, New Delhi Prentice-Hall of Japan, Inc., Tokyo Simon & Schuster Asia Pte. Ltd., Singapore Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro
Using the ordered array of tuition rates charged (in thousands of dollars) to out-of-state residents from our sample of six Pennsylvania schools: 4.9 6.3 7.7 8.9 10.3 I 1.7 the range is 11.7-4.9 = 6.80 thousand dollars. The range measures the total spread in the batch of data. Although the range is a simple, easily calculated measure of total variation in the data, its distinct weakness is that it fails to take into account how the data are actually distributed between the smallest and largest values. This can be observed from Figure 4.4. Thus, as evidenced in scale C, it would be improper to use the range as a measure of variation when either one or both of its components are extreme observations. 7 8 9 1 1 12 Scale A 7 8 9 1 '0 dl 12 Scale B 0 0 O 0 0 Scale C 7 8 9 1 '0 1 ' 1,1X 13 Figure 4.4 Comparing three data sets with the same range. The Interquartile Range The interquartile range (also called midspread) is the difference between the third and first quartiles in a batch of data. That is, Interquartile range = Q, (4.5) This simple measure considers the spread in the middle 50% of the data and thus is in no way influenced by possibly occurring extreme values. Measures of Variation I i 9
For the Pennsylvania tuition rate data we have Interquartile range = Q3 - Q1 = 10.3 6.3 = 4.0 thousand dollars This is the range in tuition rates for the middle group of Pennsylvania schools. 4.5.3 The Variance and the Standard Deviation Although the range is a measure of the total spread and the interquartile range is a measure of the middle spread, neither of these measures of variation takes into consideration how the observations distribute or cluster. Two commonly used measures of variation that do take into account how all the values in the data are distributed are the variance and its square root, the standard deviation. These measures evaluate how the values fluctuate about the mean. Defining the Sample Variance The sample variance is roughly (or almost) the average of the squared differences between each of the observations in a batch of data and the mean. Thus, for a sample containing n observations, X 1, X 2,..., X,,, the sample variance (given by the symbol S 2) can be written as s2 (X, X) 2 + (X2 X) 2 ± (X, - 50 2 n 1 Using our summation notation, the above formulation can be more simply expressed as (X - X ) 2 s 2 = "1 n 1 (4.6) where X = sample arithmetic mean n = sample size X; = ith value of the random variable X 1(x, 30 2 = summation of all the squared differences between the X, values and X Had the denominator been n instead of n 1, the average of the squared differences around the mean would have been obtained. However, n 1 is used here because of certain desirable mathematical properties possessed by the statistic S 2 that make it appropriate for statistical inference (see Chapter 9). If the sample size is large, division by n or n 1 doesn't really make much difference. Defining the Sample Standard Deviation The sample standard deviation (given by the symbol S) is simply the square root of the sample variance. That is, 120 Chapter 4 Summarizing and Describing Numerical Data
S = (4.7) n - 1 Computing S2 and S To compute the variance we 1. Obtain the difference between each observation and the mean 2. Square each difference 3. Add the squared results together 4. Divide the summation by n - 1 To compute the standard deviation, we merely take the square root of the variance. For our sample of six Pennsylvania schools, the raw (tuition rate) data (in iikousands of dollars) were and X = 8.30 thousand dollars. The sample variance is computed as S 2 = = 1 = 10.3 4.9 8.9 11.7 6.3 7.7 n 1(x -X) 5 n - 1 2 (10.3-8.3) 2 + (4.9-8.3) 2 + + (7.7-8.3) 2 6-1 31.84 6.368 (in squared thousands of dollars) d the sample standard deviation is computed as S r s2 = -\ 1 = 1 n - 1 X) 2 = \16.368 = 2.52 thousands of dollars Obtaining S 2 and S Since in the preceding computations we are squaring the differences, neither the variance nor the standard deviation can ever be negative. The only time S2 and S could be zero would be when there was no variation at all in the data when each observation in the sample was exactly the same. In such an unusual case the range would also be zero. But numerical data are inherently variable not constant. Any random phenomenon of interest that we could think of usually takes on a variety of values. For example, colleges and universities charge different rates of tuition for out-of-state residents just as people have different IQs, incomes, weights, heights, ages, pulse rates, etc. It is because numerical data inherently vary that it becomes so important to study not only measures (of central tendency) that summarize the data but also measures (of variation) that reflect how the numerical data are dispersed. Measures of Variation
What the Variance and the Standard Deviation Indicate The variance and the standard deviation measure the "average" scatter around the mean that is, how larger observations fluctuate above it and how smaller observations distribute below it. The variance possesses certain useful mathematical properties. However, its computation results in squared units squared thousands of dollars, squared dollars, squared inches, etc. Thus, for practical work our primary measure of variation will be the standard deviation, whose value is in the original units of the data thousands of dollars, dollars, inches, etc. In the Pennsylvania tuition rate sample the standard deviation is 2.52 thousand dollars. This tells us that the majority of the tuition rates in this sample are clustering within 2.52 thousand dollars around the mean of 8.30 thousand dollars (that is, between 5.78 and 10.82 thousand dollars). Why We Square the Deviations deviation could not merely use The formulas for variance and standard i = 1 (Xi - X ) as a numerator, because you may recall that the mean acts as a balancing point for observations larger and smaller than it. Therefore, the sum of the deviations about the mean is always zero 3; that is, i (X X) = 0 I To demonstrate this, let us again refer to the Pennsylvania tuition rate data: Therefore, 10.3 4.9 8.9 11.7 6.3 7.7 1(X X) = (10.3-8.3) + (4.9-8.3) + (8.9-8.3) =0 + (11.7-8.3) + (6.3-8.3) + (7.7-8.3) This is depicted in the accompanying dot scale diagram displayed in Figure 4.5. As already noted, three of the observations are smaller than the mean and Tuition rates at six Pennsylvania schools Figure 4.5 The mean as a balancing point. Chapter 4 Summarizing and Describing Numerical Data
three are larger. Although the sum of the six deviations (2.0, -3.4, 0.6, 3.4, -2.0, and -0.6) is zero, the sum of the squared deviations allows us to study the variation in the data. Hence we use (xi - TO 2 =1 when computing the variance and standard deviation. In the squaring process, observations that are farther from the mean get more weight than observations closer to the mean. The respective squared deviations for the Pennsylvania tuition rate data are 4.00 11.56 0.36 11.56 4.00 0.36 We note that the fourth observation (X 4 = 11.7 thousand dollars) is 3.4 thousand dollars higher than the mean, and the second observation (X 2 = 4.9 thousand dollars) is 3.4 thousand dollars lower. In the squaring process both these values contribute substantially more to the calculation of S 2 and S than do the other observations in the sample, which are closer to the mean. Therefore we may generalize as follows: 1. The more spread out or dispersed the data are, the larger will be the range, the interquartile range, the variance, and the standard deviation. 2. The more concentrated or homogeneous the data are, the smaller will be the range, the interquartile range, the variance, and the standard deviation. 3. If the observations are all the same (so that there is no variation in the data), the range, interquartile range, variance, and standard deviation will all be zero. Computing S2 and S: "Hand-Held Calculator" Formulas The formulas for variance and standard deviation, Equations (4.6) and (4.7), are definitional formulas, but they are often not practical to use even with a hand-held calculator. For our Pennsylvania tuition rate data the mean, 8.30 thousand dollars, is not an integer. For these more typical situations, where the observations and the mean are unlikely to be integers, the following "hand-held calculator" formulas for the variance and the standard deviation are given for practical use: s 2 = 2 - n X 2 (4.8) n - 1 S = 1 xi2 nx2 n - 1 (4.9) Measures of Variation
where i = summation of the squares of the individual observations nx 2 = sample size times the square of the sample mean The hand-held calculator formulas, Equations (4.8) and (4.9), are identical to the definitional formulas, Equations (4.6) and (4.7). Since the denominators are the same, it is easy to show through expansion and the use of summation rules (see Appendix B) that P - X ) 2 X - nx 2 1=1 1=1 Moreover, since S 2 (and S) can never be negative, the summation of squares, must always equal or exceed n i=1 )c n 2 the sample size times the square of the sample mean. Returning to the Pennsylvania tuition rate data, the variance and standard deviation are recomputed using Equations (4.8) and (4.9) as follows: and S 2 = n 1 nx 2 (10.3 2 + 4.9 2 + + 7.7 2 ) 6(8.3 2 ) 6 1 (106.09 + 24.01 + + 59.29) 6(68.89) 5 445.18 413.34 5 31.84 = 6.368 (in squared thousands of dollars) 5 S = J6.368 = 2.52 thousand dollars 4.5.4 The Coefficient of Variation Unlike the previous measures we have studied, the coefficient of variation is a relative measure of variation. It is expressed as a percentage rather than in terms of the units of the particular data. I 24 Chapter 4 Summarizing and Describing Numerical Data