M7 Chapter 3 Section 1 OBJECTIVES Suarize data using easures of central tendency, such as the ean, edian, ode, and idrange. Describe data using the easures of variation, such as the range, variance, and standard deviation. Identify the position of a data value in a data set using various easures of position, such as percentiles, deciles, and quartiles. Use the techniques of exploratory data analysis, including boxplots and five-nuber suaries to discover various aspects of data. INTRODUCTION Statistical ethods can be used to suarize data. Measures of average are also called easures of central tendency and include the ean, edian, ode, and idrange. They easure the center of the distribution or the ost typical case. Measures that deterine the spread of data values are called easures of variation or easures of dispersion and include the range, variance, and standard deviation. Measures of position tell where a specific data value falls within the data set or its relative position in coparison with other data values. The ost coon easures of position are percentiles, deciles, and quartiles. The easures of central tendency, variation, and position are part of what is called traditional statistics. This type of data is typically used to confir conjectures about the data. Another type of statistics is called exploratory data analysis. These techniques include the box plot and the five-nuber suary. They can be used to explore data to see what they show. BASIC VOCABULARY A statistic is a characteristic or easure obtained by using the data values fro a saple. Roan letters are used to represent statistics (exaple: the sybol X represents the saple ean) A paraeter is a characteristic or easure obtained by using all the data values for a specific population. Greek letters are used to represent paraeters (exaple: the greek letter µ (u) is used to represent the population ean) When the data in a data set is ordered it is called a data array. Section 3-1 Page 1
M7 Chapter 3 Section MEASURES OF CENTRAL TENDENCY Mean Median Mode Midrange Mean (also known as Arithetic Average): The ean is the su of the values, divided by the total nuber of values. The sybol X represents the saple ean: X X X Xn X 1 + + 3 + + X = = n n For a population, we use the Greek letter µ to denote the ean. Rounding Rule for ean: one ore decial place than occurs in raw data. Exaple: Find the ean of the saple: 0, 6, 40, 36, 3, 4, 35, 4, 30. X 0 + 6 + 40 + 36 + 3+ 4 + 35 + 4 + 30 76 X = = = = 30.7 n 9 9 Mean for Grouped Data : f X X = n Exaple 3-3 A Class B Frequency (f) Section 3- Page C Midpoint ( X ) D f X 5.5-10.5 1 8 8 10.5-15.5 13 6 15.5-0.5 3 18 54 0.5-5.5 6 3 115 5.5-30.5 4 8 11 30.5-35.5 3 33 99 35.5-40.5 38 76 n=0 f = 490 f X 490 X = = = 4.5 n 0 Procedure: Step 1: Make a table with 4 coluns, and populate the first two coluns with the class boundaries and frequency values Step : Find the Midpoint ( X ) of each class and enter it in Colun C Step 3: For each class, ultiply the frequency ( f ) by its idpoint and place the product in Colun D. Step 4: Find the su of colun D Step 5: Divide this su by the su of all the frequencies. X
M7 Chapter 3 Section Median The edian is the halfway point in a data array. Sybol is MD (Data array iplies that the data ust be arranged in order) Exaple: Find MD of the values: 713, 300, 618, 595, 311, 401, 9 Step 1: Arrange data in order: 9, 300, 311, 401, 595, 618, 713 Step : Select the iddle point: 401 Hence MD = 401 Exaple: Find MD of the values: 656, 684, 70, 764, 856, 113, 1133, 1303 Step 1: Arrange data in order: 656, 684, 70, 764, 856, 113, 1133, 1303 Step : Select iddle point: Middle point falls between 764 and 856. In this case we calculate the ean of these two values: 764 + 856 MD = Avg of two idpoint values = = 810 Mode Mode is the value that occurs ost often in a data set. A data set can have ore than one ode, or no ode at all. Exaple: 8,9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10, 14, 11, 8, 14, 11 It is helpful to arrange the data in order: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 10, 11, 11, 14, 14, 14 Observe that 8 occurs 5 ties, the ost of any other nuber. Thus, ode = 8. Exaple: 5, 8, 9, 10, 1, 45, 78 : Since each value occurs only once, there is no ode. Exaple: 15, 18, 18, 18, 0,, 4, 4, 4, 6, 6. 18 and 4 both occur three ties, ore than any other values. Thus the ode for this data set is 18 and 4. Grouped Data: The ode for Grouped Data id the odal class. The odal class is the class with the largest frequency. In Exaple 3-3 above, the odal class is 0.5-5.5. The ode is the only easure of central tendency that can be used with categorical distributions. Midrange The idrange is a rough estiate of the iddle. It is found by adding the lowest and highest values in the data set and dividing by. Sybol used: MR 1 + 8 Exaple:, 3, 6, 8, 4, 1 MR = = 4. 5 Section 3- Page 3
M7 Chapter 3 Section The Weighted Mean Weighted Mean ultiply each value by its corresponding weight and divide the su of the products by the su of the weights. w1 X1 + w X + w3 X 3 +... + w wx n X n X = = w + w + w +... + w w 1 3 n where w1, w,..., w n are the weights and X1, X,..., X n are the values. Exaple: Grade Point Average Course Credits (w) Grade ( X ) Engl Cop 3 A (4 points) Intr Psychology 3 C ( points) Bio I 4 B (3 points) Phys Ed D (1 point) wx 3 4 + 3 + 4 3 + GPA = X = = 1 = 3 =.7 w 3+ 3+ 4 + 1 1. The idrange is easy to copute.. The idrange gives the idpoint. 3. The idrange is affected by extreely high or low values in a data set. Section 3- Page 4 Properties and Uses of Central Tendency The Mean 1. One coputes the ean by using all the values of the data.. The ean, in ost cases, varies less than the edian or ode when saples are taken fro the sae population and all three easures are coputed for these saples. 3. The ean is used in coputing other statistics, such as variance. 4. The ean for the data set is unique, and not necessarily one of the data values. 5. The ean cannot be coputed for an open-ended frequency distribution. 6. The ean is affected by extreely high or low values and ay not be the appropriate average to use in these situations. The Median 1. The edian is used when one ust find the center or iddle value of a data set.. The edian is used when one ust deterine whether the data values fall into the upper half or lower half of the distribution. 3. The edian is used to find the average of an open-ended distribution. 4. The edian is affected less than the ean by extreely high or extreely low values. The Mode 1. The ode is used when the ost typical case is desired.. The ode is the easiest average to copute. 3. The ode can be used when the data are noinal, such as religious preference, gender, or political affiliation. 4. The ode is not always unique. A data set can have ore than one ode, or the ode ay not exist for a data set. The Midrange
M7 Chapter 3 Section Distribution Shapes In a syetrical distribution, the data values are evenly distributed on both sides of the ean Syetric Distribution Mean Median Mode In a positively skewed or right skewed distribution, the ajority of the data values fall to the left of the ean and cluster at the lower end of the distribution. Mean Median Mode Right-skewed Distribution When the ajority of the data values fall to the right of the ean and cluster at the upper end of the distribution, with the tail to the left, the distribution is said to be negatively skewed or left skewed. Left-skewed Distribution Mean Median Mode Section 3- Page 5
M7 Chapter 3 Section 3 MEASURES OF VARIATION Exaple 3-18 : Sae ean, very different data spread In order to easure the spread or variability of a data set, we use three easures: Range Variance Standard Deviation. Range The range of a data set is the difference between its highest and lowest values R = Highest Value - Lowest Value Variance and Standard Deviation The easures of variance and standard deviation are used to deterine the consistency of a variable. Rounding Rules: Sae rule as for the ean: The final answer should be rounded to one ore decial place than that of the original data. Variance is the average of the squares of the distance that each value is fro the ean. The sybol for the population variance is σ ; the forula is: Where ( X ) σ = N µ X = individual value µ = population ean N = population size The standard deviation is the square root of the variance. The sybol for the population standard deviation is σ. The corresponding forula is: σ = σ = ( X ) N µ Section 3-3 Page 6
M7 Chapter 3 Section 3 Exaple: 3 1; Consider the data set whose values are in colun A. A Values (X) B X µ C µ ( X ) 10 10 35 = -5 65 60 60-35 = +5 65 50 50-35 = +15 5 30 30-35 = -5 5 40 40 35 = +5 5 0 0-35 = -15 5 ( µ ) = 1750 10 + 60 + 50 + 30 + 40 + 0 10 Step 1: Copute the ean: µ = = = 35 6 6 Step : Subtract the ean (35) fro each value; result in colun B Step 3: Square each difference; result in colun C Step 4: Su the squares in colun C 1750 Step 5: Divide this su by the nuber of values (6) to get the variance: σ = = 91. 7 6 Step 6: Take the square root of this variance to obtain the standard deviation: σ = σ = 91.7 = 17.1 Forula for the Saple Variance and Standard Deviation: Variance of a saple: s = X X n 1 Standard Deviation of a saple: s = X X n 1 Where X = individual value X = saple ean n = saple size Section 3-3 Page 7
M7 Chapter 3 Section 3 Coputational Forulas for saple variance and saple standard deviation: s Saple Variance X = n 1 [( X ) n] Saple Standard Deviation X s = n 1 [( X ) n] Exaple: 11., 11.9, 1.0, 1.8, 13.4, 14.3 Step 1: Find su of values: X = 11. + 11.9 + 1.0 + 1.8 + 13.4 + 14.3 = 75.6 Step : Square each value and find the su: X = 11. + 11.9 + 1.0 + 1.8 + 13.4 + 14.3 = 958.94 Step 3: Substitute in the forula and solve: X ( X ) n 958.94 ( 75.6 ) / 6 s = = = 1.8 n 1 5 s = 1.8 = 1.13 Grouped Data: Variance and Standard Deviation A Class B Frequency (f) C Midpoint ( X ) D f X E f 5.5-10.5 1 8 8 64 10.5-15.5 13 6 338 15.5-0.5 3 18 54 97 0.5-5.5 5 3 115 645 5.5-30.5 4 8 11 3136 30.5-35.5 3 33 99 367 35.5-40.5 38 79 888 n = 0 f X = 490 Step 1: Find idpoint of each class and place in colun C X f X = 13,310 Step : Multiply the frequency of each class by its idpoint and place product in colun D Step 3: Multiply the frequency of each class by the square of its idpoint and place the result in colun E. Step 4: Find the su of coluns B, D, and E. Step 5: Substitute in the Coputational forula and solve for s to get the variance: Section 3-3 Page 8
M7 Chapter 3 Section 3 s s ( ) f X f X / n = n 1 13,310 ( 490) 0 = = 68.7 0 1 Then, take the square root to obtain the Standard Deviation s = 68. 7 = 8.3 Coefficient of Variation is a statistic that allows one to copare standard deviations when the units of the data sets are different. Sybol: Cvar. The coefficient of variation is the standard deviation divided by the ean. The result is expressed as a percentage. s For saples: CVar = 100% X σ For populations: CVar = 100% µ Exaple: In a randoly selected saple of people it is found that their height and weight have the following characteristics: Height: ean = 68.34 in, s = 3.0 in; Weight: ean = 17.55 lb, s = 6.33 lb. How do these two variables copare? Height 3.0 CVar = 100% = 4.4% Weight 68.34 6.33 CVar = 100% = 15.6% 17.55 We see that the heights have considerably less variation than the weights. This akes intuitive sense, since we routinely see that weights between people vary uch ore than heights. Rough Estiate of Standard Deviation range s Range rule of thub 4 Chebyshev s Theore and Epirical Rule Chebyshev s Theore specifies the proportions of the spread of data in ters of the standard deviation: Theore: The proportion of values fro a data set that will fall within k standard deviations of the ean will be at least 1 1/k, where k is a nuber greater than 1 (k is not necessarily an integer). Section 3-3 Page 9
M7 Chapter 3 Section 3 At Least 88.9% At Least 75% X 3s X s X X + s X + 3s Chebyshev's Theore (for k= and k=3) Exaple 1: What percent of data values will fall within standard deviations of the ean? Calculate 1 1 k for k = : 1 1 1 1 3 4 75% = 4 = = Thus, if we have a distribution that has a ean of 70 and a standard deviation of, at least 75% of the data values will fall between 66 and 74 (70 = 66 and 70 + = 74) Exaple : A distribution has a ean of 50 and a standard deviation of 5. At least what percentage of the values will fall between 40 and 60? Step 1: Subtract the ean fro the largest value: 60 50 = 10 Step : Divide the difference by the standard deviation to get k: K = 10 / 5 = Step 3: Use Chebyshev s Theore to find the percentage: 1 1/k = 1 1/ = 1 1/4 = 0.75 or 75% Exaple : A saple of the labor costs per hour to asseble a certain product has a ean of $.60 and a standard deviation of $0.15. Using Chebyshev s theore, find the range in which at least 88.89% of the data will lie. Step 1: Calculate k: 1-1/k =.8889 k 1 = 0.8889k k (1-0.8889) = 1 k = 1/0.1111 = 9 k = 3 Step : 3 standard deviations are: 3 (0.15) = 0.45. Step 3: Range is: Low:.60 0.45 =.15, High:.60 + 0.45 = 3.05. Section 3-3 Page 10
M7 Chapter 3 Section 3 Epirical Rule for Noral Distributions The following apply to a bell-shaped (noral) distribution. Approxiately 68% of the data values fall within one standard deviation of the ean. Approxiately 95% of the data values fall within two standard deviations of the ean. Approxiately 99.75% of the data values fall within three standard deviations of the ean. 99.8% 75% 68% X 3s X s X 1s X X + 1s X + s X + 3s Epirical Rule for Noral Distributions Suary of Measures of Variation Variances and standard deviations can be used to deterine the spread of the data. If the variance or standard deviation is large, the data are ore dispersed. The inforation is useful in coparing two or ore data sets to deterine which is ore variable. The easures of variance and standard deviation are used to deterine the consistency of a variable. For exaple, in the anufacture of fittings, such as nuts and bolts, the variation in the diaeters ust be sall, or the parts will not fit together. The variance and standard deviation are used to deterine the nuber of data values that fall within a specified interval in a distribution. For exaple, Chebyshev s theore (explained later) shows that, for any distribution, at least 75% of the data values will fall within standard deviations of the ean. The variance and standard deviation are used quite often in inferential statistics. Section 3-3 Page 11
M7 Chapter 3 Section 4 MEASURES OF POSITION Measures of position are used to located the relative position of a data value in the data set z score Percentiles Quartiles Deciles z-score or standard score represents the nuber of standard deviations that a data value ean value lies above or below the ean: z = standard deviation X X z = for saples and s µ z = X for populations σ Exaple 1: Which of these exa grades has a better relative position? (a) A grade of 56 on a test with ean of 48 and a standard deviation of 5. (b) A grade of 45 on a test with ean of 35 and standard deviation of 10. a) z = (56-48)/5 = 8/5 = 1.6 b) z = (45-35)/10 = 10/10 = 1 Exaple : Huan body teperatures have a ean of 98.0 and a standard deviation of 0.6. An Eergency roo patient is found to have a teperature of 101. Convert 101 to a z-score. z = (101-98.)/0.6 =.8/0.6 = 4.51 When all data for a variable are transfored into z scores, the resulting distribution will have a ean of 0 and a standard deviation of 1. A z score, then, is actually the nuber of standard deviations each value is fro the ean for a specific distribution. See exaple below: X z z*s Mean+z*s 4-0.74-3.7 4 7-0.14-0.7 7 1-1.34-6.7 1 7-0.14-0.7 7 15 1.46 7.3 15 3-0.94-4.7 3 11 0.66 3.3 11 17 1.86 9.3 17 7-0.14-0.7 7 9 0.6 1.3 9 4-0.74-3.7 4 Mean 7.7 0 s 5 1 Section 3-4 Page 1
M7 Chapter 3 Section 4 Percentiles Percentiles divide data into 100 equal groups. Percentiles are sybolized by P 1, P,, P n and divide the distribution into 100 groups. Percentile Forula (Percentile Rank) The percentile corresponding to a given value x is coputed by using the following forula: ( nuber of values below x) + 0.5 Percentile = 100% total nuber of values Exaple 1: Find the percentile rank for each test score in the data set: 5, 15, 1, 16, 0, 1 Step 1: Arrange data in ascending order: 5, 1, 15, 16, 0, 1 +.5.5 50 Step : Use forula (say for 15): P = 100% = 100% = % = 41.7 percentile 6 6 6 Forula for finding a value corresponding to a given percentile P P - is the nuber that separates the botto % of the data fro the top (100- )% of the data. This ean that if your test score is the 90 th percentile, 90% of the people who took the test scored lower than you and only 10% scored higher than you. Finding the location of P : c = ( ) n 100 1. If c is not a whole nuber, round up to the next whole nuber. Starting at the lowest value, count over to the nuber that corresponds to the rounded-up value.. If c is a whole nuber, use the value halfway between the c th and (c+1) th values when counting up fro the lowest value. Section 3-4 Page 13
M7 Chapter 3 Section 4 Quartiles divide the distribution into four groups, separated by Q 1, Q, Q 3. Q 1 is the sae as the 5% percentile; Q is the sae as the 50% percentile; Q 3 is the sae as the 75% percentile. Quartiles ac also be found as follows: 1. Arrange the data in ascending order. Find the edian of the data values, this is the values for Q. 3. Find the edian of the data values that fall below Q. This is the value for Q 1. 4. Find the edian of the data values that fall below Q. This is the value for Q 3. Exaple: Find Q 1, Q, and Q 3 for the data set: 15, 13, 6, 5, 1, 50,, 18 1. Arrange in ascending order: 5, 6, 1, 13, 15, 18,, 50 13+ 15. Find the edian (Q ): Q = MD = = 14 3. Find the Median of the data values less than 14: 5, 6, 1, 13 6 + 1 Q1 = MD = = 9 4. Find the Median of the data values greater than 14: 15, 18,, 50 18 + Q3 = MD = = 0 IQR: Interquartile Range: used as a rough estiate of variability in exploratory data analysis and it is also used to identify Outliers. IQR = Q 3 Q 1 An Outlier is an extreely high or low data value when copared with the rest of the data values An outlier can strongly affect the ean and standard deviation of a variable. One ethod to check for Outliers: Step 1: Arrange the data in order and find Q 1 and Q 3. Step : Find IQR = Q 3 Q 1. Step 3: Multiply IQR by 1.5 Step 4: Subtract the value obtained in Step 3 fro Q 1 and add the value to Q 3. Step 5: Check the data set for any data value that is saller than Q1 1.5 IQR or larger than Q3 + 1.5 IQR Another ethod, applicable to bell-shaped distributions, is to check the data value in question to: 3s ± X (3 standard deviations larger / saller than the ean). If it is larger (or saller), then the data value is considered an outlier. Section 3-4 Page 14
M7 Chapter 3 Section 5 EXPLORATORY DATA ANALYSIS (EDA) Traditional vs Exploratory The purpose of the traditional analysis is to confir various conjectures about the data. The purpose of exploratory data analysis is to exaine data in order to find out what inforation can be discovered. For exaple: o Are there any gaps in the data? o Can any patterns be discerned? In EDA we use the following easures: Organize data: Ste and Leaf plot Measure of Central Tendency: Median Measure of Variation: Interquartile range (Q 3 Q 1 ) Graphing: Boxplot Boxplots are graphical representations of a five-nuber suary of a data set. The five specific values that ake up a five-nuber suary are: The lowest value of data set (iniu) Q1 (or 5th percentile) The edian (or 50th percentile) Q3 (or 75th percentile) The highest value of data set (axiu) Exaple: Construct a Boxplot for the data: 33, 38, 43, 30, 9, 40, 51, 7, 4, 3, 31 1. Arrange the data in order: 3, 7, 9, 30, 31, 33, 38, 40, 4, 43, 51. Find the edian: MD = 33 3. Find Q 1 = 9 4. Find Q 3 = 4 5. Draw a scale for the data on the x-axis. 6. Locate the five-nuber suary on the scale. 7. Draw a box around Q 1 and Q 3, draw a vertical line through the edian, and connect the upper and lower values, as shown below: 9 33 4 3 51 0 5 30 35 40 45 50 55 Boxplot Section 3-5 Page 15
M7 Chapter 3 Section 5 If the boxplots of two or ore data sets are graphed on the sae axis, the distributions can be copared. Exaple: Consider the following two sets of data A B 40 130 45 180 90 50 180 60 0 70 40 90 310 310 40 340 40 67. 5 130 00 75 15 65 300 340 40 A B SDEV 133. 68.6 Q 1 67.5 15 Q 75 300 MD 00 65 0 100 00 300 400 500 The variation or spread for the distribution of Data A is larger than the distribution of data B (observe also their standard deviations.) SUMMARY: Traditional Frequency Distribution Histogra Mean Standard Deviation Exploratory Ste and Leaf plot Boxplot Median Interquartile Range Soe basic ways to suarize data include easures of central tendency, easures of variation or dispersion, and easures of position. The three ost coonly used easures of central tendency are the ean, edian, and ode. The idrange is also used to represent an average. The three ost coonly used easureents of variation are the range, variance, and standard deviation. The ost coon easures of position are percentiles, quartiles, and deciles. Data values are distributed according to Chebyshev s theore and in special cases, the epirical rule. The coefficient of variation is used to describe the standard deviation in relationship to the ean. These ethods are coonly called traditional statistics. Other ethods, such as the ste and leaf charts, boxplot and five-nuber suary, are part of exploratory data analysis; they are used to exaine data to see what they reveal. Section 3-5 Page 16