SticiGui Chapter 4: Measures of Location and Spread Philip Stark (2013)

SticiGui Chapter 4: Measures f Lcatin and Spread Philip Stark (2013) Summarizing data can help us understand them, especially when the number f data is large. This chapter presents several ways t summarize quantitative data by a typical value (a measure f lcatin, such as the mean, median, r mde) and a measure f hw well the typical value represents the list (a measure f spread, such as the range, inter-quartile range, r standard deviatin). Markv s and Chebychev's inequalities shw that these summary measures can cntain a surprisingly large amunt f infrmatin abut the data. Measures f Lcatin The farthest ne can reduce a set f data, and still retain any infrmatin at all, is t summarize the data with a single value. Measures f lcatin d just that: They try t capture with a single number what is typical f the data. What single number is mst representative f an entire list f numbers? We cannt say withut defining "representative" mre precisely. We will study three cmmn measures f lcatin: the mean, the median, and the mde. The mean, median and mde are all "mst representative," but fr different, related ntins f representativeness. We saw the median in Chapter 3, Statistics. The median is the number that divides the (rdered) data in half the smallest number that is at least as big as half the data. At least half the data are equal t r smaller than the median, and at least half the data are equal t r greater than the median. The mde f a set f data (as ppsed t the mde f a histgram) is the mst cmmn value amng the data. It is rare that several data cincide exactly, unless the variable is discrete, r the measurements are reprted with lw precisin. The mean (mre precisely, the arithmetic mean) is cmmnly called the average. It is the sum f the data, divided by the number f data: sum f data ttal mean = ---------------------- = -------------------. number f data number f data Fr qualitative and categrical data, the mde makes sense, but the mean and median d nt. It is hard t see the cnnectin between the mean, median, and mde frm their definitins. Used by permissin. Page 1 f 23

Hwever, the mean, the median, and the mde are "as clse as pssible" t all the data: Fr each f these three measures f lcatin, the sum f the distances between each datum and the measure f lcatin is as small as it can be. The differences amng the three measures f lcatin are in hw "distance" is defined. Fr the mean, the distance between tw numbers is defined t be the square f their difference. That is, the sum f the squares f the differences between the data and the mean is smaller than the sum f squares f the differences between the data and any ther number. (Equivalently, the RMS r rt mean square f the differences frm the mean is smaller than the rms f the list f differences frm any ther number the rms is defined and discussed belw.) Fr the median, the distance between tw numbers is defined t be the abslute value f their difference. That is, the sum f the abslute values f the differences between a median and the data is n larger than the sum f the abslute values f the differences between any ther number and the data. Fr the mde, the distance between tw numbers is defined t be zer if the numbers are equal, and ne if they are nt equal. That is, the number f data that differ frm a mde is n larger than the number f data that differ frm any ther value. Equivalently, a mde is a number frm which the fewest pssible data differ: a "mst cmmn" value. All three f these measures f lcatin are examples f statistics (with a lwercase "s"): numbers cmputed frm data. The mean, median, and mde can be related (apprximately) t the histgram: lsely speaking, the mde is the highest bump, the median is where half the area is t the right and half is t the left, and the mean is where the histgram wuld balance, were it a slid bject cut ut f a unifrm blck f metal. (All these heuristics are apprximate, and depend n the class intervals.) Example 4-1: Calculating the mean, median and mde f a list. Fr illustratin, let's cmpute the mean, median and mde, frm the hypthetical data in Table 4-1. Table 4-1: Randm data t illustrate calculating measures f lcatin. data 4 0 5-2 -3-5 Used by permissin. Page 2 f 23

Table 4-2: Srted randm data t illustrate calculating measures f lcatin. This table shws these randm data srted int increasing rder, which makes it easier t calculate the median. data -5-3 -2 0 4 5 Half the data are less than r equal t every number between -2 (inclusive) and 0 (exclusive). By ur definitin, the median is the smallest such number, namely, - 2. Fr these hypthetical data, every value in the list is a mde: each value ccurs exactly nce, s all are "mst cmmn." Cmputing the mean is familiar: 4 + 0 + 5 + (-2) + (-3) + (-5) --------------------------------- = -0.167. 6 In general, the mean and the median need nt be clse tgether. If the data have a symmetric distributin, the mean and median are exactly equal, but if the distributin f the data is skewed, the difference between mean and the median can be large. This is because data in the tails f the distributin have a lt f leverage n the mean, just as a light persn can balance a much heavier ne n a teeter-ttter if she sits much farther frm the fulcrum than the heavier persn des. The median is smaller than the mean if the data are skewed t the right, and larger than the mean if the data are skewed t the left. Because the mean is (essentially) the balance pint f the histgram, a small number f data can affect it a great deal, if they are very large (psitive r negative). Crrupting just ne datum can make the mean arbitrarily large r small. The median is affected much less by small subsets f the data. T make the median arbitrarily large r small, ne must crrupt half the data. Crrupting just ne datum changes the median by a limited amunt, and nt at all if ne f the bservatins abve the median is made larger, r ne f the bservatins belw the median is made smaller. Statistics that are nt affected t much by small subsets f the data are resistant. The median is resistant; the mean is nt. Which measure f lcatin is the mst apprpriate depends n the intended use f the summary. If we are interested in a ttal, the mean tends t be the mst relevant, Used by permissin. Page 3 f 23

because the mean is equal t the ttal divided by the number f data. Fr example, the mean incme f the individuals in a family indicates hw much the family can spend n each family member's necessities f life. On the ther hand, the median can be much mre infrmative in ther situatins. Suppse we want t knw hw much mney a family can affrd t spend n husing. That depends n the ttal family incme, which is the mean incme f the family members, times the number f family members. Fr a family f five, cnsisting f tw parents wh wrk and three children with n incme, the mean incme, times five, is the ttal amunt f mney the family makes each year. The median incme f these five family members is zer, because mre than half f them make nthing. On the ther hand, suppse we want t decide whether a cuntry is affluent. At issue, in sme sense, is whether mst f the citizens have a high incme. The mean family incme culd be quite high even if mst families earn essentially nthing if incme is highly cncentrated in a few very wealthy families. Then the median family incme wuld be a mre meaningful measure: At least half the families make n mre than the median, and at least half make at least as much as the median. Similarly, suppse yu are applying fr a jb as an architect at several large firms, and yu want t get an idea f hw much mney yu might expect t be earning in five years if yu jin a particular firm. Cnsider the salaries f architects in each firm five years after they are hired. Just ne very high salary culd make the mean salary high, s the mean might nt reflect what is typical. On the ther hand, half the architects make the median salary r less, and half make the median salary r mre, s the median wuld give yu a better idea f a typical salary. Chsing a measure f lcatin favrable t ne's pint f view is a cmmn way t mislead peple with statistics. Fr example, suppse yu are the CEO f a cmpany that makes gizms and gadgets. It might be in yur interest t claim t yur custmers that yu have lwered yur prices, and t claim t yur sharehlders that yu have raised yur prices. Suppse that last year, yu sld 100,000 gizms at $10 each, and 1,000 gadgets at $1000 each. This year, yu sld 100,000 gizms at $8 each, and 1,000 gadgets at $1200 each (see Table 4-3). Table 4-3: Quantities and Prices fr Tw Years f Gizm and Gadget Sales. Used by permissin. Page 4 f 23

Item Quantity Each Year Price Last Year Price This Year Gizm 100,000 $10 $8 Gadget 1,000 $1,000 $1,200 The median price f the 101,000 items sld last year is $10, because mre than half f the items sld were gizms. The median price f the 101,000 items sld this year is $8. The mean price n the price list (withut regard fr the number f items sld) was $505 last year and $604 this year. The mean price f the 101,000 items sld last year is (100,000 x $10 + 1,000 x $1,000)/101,000 = $19.80 while this year it is (100,000 x $8 + 1,000 x $1,200)/101,000 = $19.80. The mean price per item sld is the same in bth years: the ttal revenue was the same, and the number f items sld was the same. The mral is that ne can make data appear t tell cnflicting stries by chsing a measure f lcatin disingenuusly. The fllwing exercises check yur ability t cmpute and t use the mean, median, and mde. Exercise 4-1. Cnsider the fllwing list: data -10-2 -2 5-5 -2 1. What is the median f the list? 2. What is the mde f the list? 3. What is the mean f the list? SOLUTIONS: 1. 2, 2. 2 3. -2.667 Exercise 4-2. Hmes in a certain area have a mean price f $400,000, but a median price f "nly" $250,000. Hw might yu explain this best? Used by permissin. Page 5 f 23

a. A small percentage f very inexpensive hmes makes the median small, but des nt affect the mean much. b. A small percentage f very expensive hmes makes the mean large, but des nt affect the median much. c. There must be an errr in the cmputatin. d. Mre than half f the hme prices are less than $250,000. SOLUTION: b Exercise 4-3. TRUE r FALSE: Tw cuntries have the same mean per capita persnal incme. The ttal persnal incme in the larger cuntry is larger than the ttal persnal incme in the smaller cuntry. SOLUTION: True. The ttal persnal incme is the mean persnal incme times the number f peple, s if the means are the same, the ttal is larger fr the larger cuntry. Exercise 4-4. TRUE r FALSE: Tw cuntries have the same median per capita persnal incme. The ttal persnal incme in the larger cuntry is larger than the ttal persnal incme in the smaller cuntry. SOLUTION: False. The median culd be larger r smaller than the mean, s the ttal culd be larger r smaller than the median times the number f peple, and we d nt have enugh infrmatin in this prblem t tell which cuntry has the larger ttal persnal incme. Typically, incme distributins are skewed t the right, s the mean incme is generally larger than the median incme; hwever, even if that is true fr bth cuntries, it need nt be larger by the same amunt in bth cuntries. Exercise 4-5. Cnsider the fllwing game. Yu pick a number (nt necessarily an integer). I rll a fair die, and pay yu $10, minus the square f the difference between yur guess and the number the die lands n. We play ver and ver again. T win the mst mney in the lng run, yu shuld pick a. 3 b. 3.5 c. 4 d. 5 e. It desn t matter Used by permissin. Page 6 f 23

SOLUTION: b. T make yur winnings large, yu want t make the average f the squared difference between yur guess and the utcme f the rll small. The mean minimizes the average f the squared deviatins, s yu shuld pick the mean f the pssible utcmes, 3.5. Measures f Lcatin Review Measures f lcatin summarize a list f numbers by a "typical" value. The three mst cmmn measures f lcatin are the mean, the median, and the mde. The mean is the sum f the values, divided by the number f values. It has the smallest pssible sum f squared differences frm members f the list. The median is the middle value in the srted list. It is the smallest number that is at least as big as at least half the values in the list. It has the smallest pssible sum f abslute differences frm members f the list. The mde is the mst frequent value in the list (r ne f the mst frequent values, if there are mre than ne). It differs frm the fewest pssible members f the list. Spread r Variability Measures f lcatin summarize what is typical f elements f a list, but nt every element is typical. Are all the elements clse t each ther? Are mst f the elements clse t each ther? What is the biggest difference between elements? On the average, hw far are the elements frm each ther? Measures f spread r variability tell us. The Imprtance f Variability Cnsider three mechanical glfers (this example is frm Hke, 1983). In glf, the bject is t get a lw scre t take fewer strkes t cmplete the curse. Suppse the glfers play as shwn in Table 4-4. Table 4-4: Perfrmance f mechanical glfers. Used by permissin. Page 7 f 23

Glfer Scre 1 Frequency 1 Scre 2 Frequency 2 Average scre 1 72 100% 72 2 69 25% 73 75% 72 3 70 50% 74 50% 72 The glfers' average scres are equal nminally, they are equally skilled. Hwever, cnsider what happens when they play each ther. Glfer 1 beats glfer 2 when glfer 2 scres 73, which happens 75% f the time. Glfer 2 beats glfer 3 when glfer 3 scres 74, and when glfer 3 scres 70 and glfer 2 scres 69. The first ccurs half the time, and, assuming that the players' scres are independent (we'll get t that ntin in Chapter 17, Prbability: Axims and Fundaments), the secnd ccurs 50% x 25% = 12.5% f the time, s glfer 2 beats glfer 3 62.5% f the time. Finally, glfer 3 beats glfer 1 when glfer 3 scres 70, namely, 50% f the time (they play evenly). Their average scres are equal, but 1 beats 2 mre ften than nt, 2 beats 3 mre ften than nt, and 3 plays 1 even. This shws that there is mre ging n than the average scres indicate: variability matters t. Here is anther example f the imprtance f variability. The average number f children under 18 per family in the US was 0.89 accrding t the 1990 census, s the average family size is abut 2.9 peple (is this lgic sund? what is a family?). If yu were in the cnstructin business that might suggest t yu that a tw-bedrm hme is the right size t build fr the average American family (tw parents sharing a rm, and anther rm fr the 0.89 children). Hwever, family sizes vary ver quite a large range; indeed, the same reprt shws that the average number f children fr families that have children is 1.86, s families that have children wuld tend t need a three bedrm hme, rather than a tw bedrm hme, if the children are t have their wn rms. Much infrmatin is lst in reducing a list f numbers t a single summary number, such as the mean r median. Measures f lcatin alne are nt very infrmative. Fr Java applet Figures 4-1, 4-2 and 4-3 (histgrams f different data sets with means equal t zer), please visit: http://www.stat.berkeley.edu/~stark/sticigui/text/lcatin.htm We need mre than just the mean r median t tell these distributins apart. In Figure 4-1 the data cluster bth in the middle and at the ends. In Figure 4-2 the data are mre cncentrated near the middle there is much less spread than in the first. Figure 4-3 is extremely cncentrated: The data are much clser t each ther than in the ther tw Used by permissin. Page 8 f 23

examples. Measures f spread r variability summarize with a single number whether the bservatins tend t cluster near the center f the distributin, r hw spread ut they are. If the spread is small, mst f the data are nearly equal; if the spread is large, there are large differences amng the data. The Range, IQR and SD The three mst cmmn measures f spread r variability are the range, the interquartile range (IQR), and the standard deviatin (SD). The range f a list is the largest value minus the smallest value. It is the width f the smallest interval that cntains all the data, s it measures spread. It is nt resistant, because changing just ne datum can make it arbitrarily large. The IQR is the upper quartile (75th percentile), minus the lwer quartile (25th percentile). It is the width f the interval that cntains the middle 50% f the data and thus is a measure f spread. It is insensitive t the mst extreme values f the data (assuming that there are mre than fur data). The IQR is resistant: changing just ne datum has a limited effect n it. Nte that neither the range nr the IQR is a range f numbers, despite their names each is a single number. The RMS (rt mean square) f a list measures the average size f its entries. It is defined as fllws: RMS = square-rt( (sum f the squares f the entries)/(# f entries) ) = [ (sum f squares f the entries)/(number f entries) ]1/2. (Recall that a number raised t the ne-half pwer is the square-rt f the number; this is the ntatin we shall use frm nw n.) In cmputing the RMS, we divide by the number f entries befre taking the squarert. What difference des it make t square the entries? Squaring them makes every term in the sum psitive, s psitive and negative entries d nt cancel. If we ignred the square and the square rt, we wuld just have the mean f the list, which culd be zer, even if all the numbers were large in magnitude, because psitive and negative entries culd cancel. Squaring the entries befre averaging them prevents cancellatins. Used by permissin. Page 9 f 23

The RMS is nt the nly measure f the average size f the elements f a list; fr example, the average abslute value f the terms is anther measure f the typical size f elements in a list. The RMS is used mre ften. Example 4-2 illustrates calculating the RMS f a list. Example 4-2: Calculating the RMS f a list. data 2 0-2 4-4 The average f this list is: 2 + 0 + (-2) + 4 + (-4) 0 -------------------------- = ------ = 0 5 5 Nnetheless, the typical "size" f elements f the list is abut 2.8. The RMS f the list is ( (2 2 + 0 2 + (-2)2 + 4 2 + (-4) 2 )/5) 1/2 = ( (4 + 0 + 4 + 16 + 16 )/5) 1/2 = ( (40)/5) 1/2 which is apprximately 2.8. Example 4-2 makes it clear that the mean f the squares f the elements f a list is nt generally equal t the square f the mean f the elements f the list: the square f the mean is 0, but the mean f the squares is nt. The RMS f a list is zer if and nly if all the entries in the list are zer. The standard deviatin (SD) f a list is the "typical size" f the difference between elements f the list and the mean f the list, measured by the RMS. The SD measures hw spread ut the data are arund their mean. T find the SD, we first find the mean f the list, then make a list f deviatins frm the mean: deviatin f value = value - mean f list, Used by permissin. Page 10 f 23

and finally, find the RMS f the list f deviatins frm the mean (the square-rt f the average f the squares f the deviatins). In the example just given, the mean is zer, s the SD is equal t the RMS. Example 4-3: Calculating the SD f a list. data 4 6 1 3 5 2 The mean f the list is ( 4 + 6 + 1 + 3 + 5 + 2)/6 = 3.5. The list f deviatins frm the mean is {(4-3.5), (6-3.5), (1-3.5), (3-3.5), (5-3.5), (2-3.5)}. = { 0.5, 2.5, -2.5, -0.5, 1.5, -1.5}. The SD is the RMS f this list f deviatins frm the mean: SD = ( (0.52 + 2.5 2 + (-2.5)2 + (-0.5)2 + 1.5 2 + (-1.5) 2 )/6) 1/2 1/2 = (17.5/6) = 1.71. The units f the SD are the same as the riginal units f measurement. Fr example, if the list is cmprised f measurements f heights in inches, the SD has units f inches. Recall that the RMS f a list is zer if and nly if all the elements in the list are zer. Thus the SD f a list is zer if and nly if all the deviatins frm the mean are zer, that is, if and nly if all the elements are equal t each ther (and hence equal t their mean). Similarly, the range f a list is zer if and nly if all the elements are equal. In cntrast, the IQR f a list can be zer even if nt all the elements are the same nly the middle 50% f the bservatins need t be equal fr the IQR t be zer. Sme calculatrs have a buttn labeled s, which cmputes smething related t the SD as we have defined it. In the usual definitin f s, the sum f squares f residuals frm the mean is divided by (number f data -1) rather than by (number f data) befre taking the square-rt. This is called the sample standard deviatin. When the number f data is large, there is nt much difference between the standard deviatin and the Used by permissin. Page 11 f 23

sample standard deviatin, but when the number f data is small, the difference can be big. The fllwing exercises check that yu can calculate measures f spread, and that yu understand what they mean. Exercise 4-6. Refer t Table 3-4, srted gravity data, in Chapter 3 (pictured belw). Table 3-4: Srted gravity data. -152-132 -132-128 -122-121 -120-113 -112-108 -107-107 -106-106 -106-105 -101-101 -99-89 -87-86 -83-83 -80-80 -79-74 -74-74 -71-71 -69-67 -67-65 -62-61 -60-60 -59-55 -54-54 -52-50 -49-48 -48-47 -44-43 -38-37 -35-34 -34-29 -27-27 -26-24 -24-19 -19-19 -19-18 -16-16 -16-15 -14-14 -12-12 -12-4 -1 0 0 1 2 7 14 14 14 14 18 18 19 24 29 29 41 45 51 72 150 155 1. The range f the gravity data (the tabulated numbers, which are 10 8 times the deviatins frm the reference value) is? 2. The IQR f the gravity data is? SOLUTIONS: 1. The smallest datum is 152 and the largest is 155, s the range is: 155 (-152) = 307 2. The lwer quartile f the gravity data is 80 and the upper quartile is 12, s the interquartile range is 12-(-80) = 68. Exercise 4.7. TRUE r FALSE: Tw students have taken all the same curses, and have the same grade pint average (GPA), 3.5. Their grades might nt have been the same in each class, but verall they must have the same number f A grades as each ther. Used by permissin. Page 12 f 23

SOLUTION: False. Tw lists can have the same mean withut having the same entries. Fr example, the grade lists {3, 3, 4, 4} and {2, 4, 4, 4} bth crrespnd t a GPA f 3.5. Exercise 4-8. Here is a table f fabricated data. data -8 7-10 10 3 1. What is the mean f the data? 2. What is the RMS f the data? 3. What is the SD f the data? SOLUTIONS: 1. 0.39 t 0.41 2. 8.015 t 8.035 3. 8.005 t 8.025 Measures f Spread Review Measures f spread summarize hw much members f a list f numbers differ frm each ther. The three mst cmmn measures f spread are the range, the inter-quartile range, and the standard deviatin. The range is the largest element f the list, minus the smallest element f the list: the maximum difference between elements f the list. It is sensitive nly t the mst extreme values in the list. The range f a list is zer if and nly if all the elements f the list are equal. The inter-quartile range (IQR) is the upper quartile f the list (75th percentile) minus the lwer quartile f the list (25th percentile). It measures the width f the interval that cntains the middle 50% f the data. It is nt sensitive t the extreme values f the list. The IQR f a list is zer if (at least) the middle 50% f the value are equal. The standard deviatin (SD) is the average distance frm the data t their mean (the rms f the deviatins f the data frm their mean). It depends n the values f all the data. The SD f a list is zer if and nly if all the elements in the list are equal (t each ther, and hence t their mean). Used by permissin. Page 13 f 23

Affine Transfrmatins Sme variables have simple relatinships t ther variables, fr example, measurements f elevatin abve sea level in feet, and measurements f elevatin abve sea level in meters: Each elevatin in meters abve sea level is 0.3048 times the crrespnding elevatin in feet abve sea level. When the relatinship between variables is simple, s is the relatinship between their measures f lcatin and spread. An affine transfrmatin r change f variables is particularly simple. Affine transfrmatins have the equatin f a line: (transfrmed value f x) = a x (riginal value f x) + b, where a and b are cnstants. (Sme bks call this a linear transfrmatin, because it has the equatin f a straight line.) Fr example, height in inches is related t height in feet by an affine transfrmatin, with a = 12 and b = 0: (height in inches) = 12 x (height in feet) + 0. Similarly, temperature in degrees Fahrenheit is related t temperature in degrees Centigrade by an affine transfrmatin with a = 9/5 and b = 32: (temp in F) = 9/5 x (temp in C) + 32. Currencies are related t each ther by affine transfrmatins as well, with a = (exchange rate) and b = 0. The measures f lcatin and spread intrduced in this chapter behave quite regularly when a list is transfrmed by an affine transfrmatin. Hw Measures f Lcatin and Spread behave under Affine Transfrmatins If a list is transfrmed s that (transfrmed value) = a x (riginal value) + b, then (Mde f transfrmed list) = a x (Mde f riginal list) + b (Median f transfrmed list)= a x (Median f riginal list) + b, if a is psitive Used by permissin. Page 14 f 23

(Mean f transfrmed list) = a x (Mean f riginal list) + b (Range f transfrmed list) = a x (Range f riginal list) (SD f transfrmed list) = a x (SD f riginal list) (IQR f transfrmed list) = a x (IQR f riginal list), if a is psitive. The median f the transfrmed list can differ slightly frm a x (median f riginal list) + b when a is negative; similarly, the IQR f the transfrmed list can differ slightly frm a x(iqr f riginal list) if a is negative, because f the definitin f percentiles applied t a list with its signs reversed. Sme f these relatins are derived in a ftnte. Using these relatins can simplify calculating measures f lcatin r spread when the units f measurement are changed. The fllwing exercise checks yur ability t use these rules. Exercise 4-9. 1. The mean f a list is 6. Cnsider multiplying each element f the list by 8 then adding 10 t get a new list. What is the mean f the new list? 2. The mde f a particular list is unique and equal t 7. Cnsider multiplying each element f the list by 8 then adding 10 t get a new list. What is the mde f the new list? 3. The median f a particular list is 16. Cnsider multiplying each element f the list by 8 then adding 10 t get a new list. What is the median f the new list? 4. The SD f a list is 22. Cnsider multiplying each element f the list by 8 then adding 10 t get a new list. What is the SD f the new list? 5. The IQR f a list is 24. Cnsider multiplying each element f the list by 8 then adding 10 t get a new list. What is the IQR f the new list? 6. The range f a list is 26. Cnsider multiplying each element f the list by 8 then adding 10 t get a new list. What is the range f the new list? SOLUTIONS: 1. 58 2. 66 3. 138 4. 176 5. 192 6. 208 Markv's Inequality and Chebychev's Inequality Used by permissin. Page 15 f 23

Measures f lcatin and spread can tell us a great deal abut lists f numbers. Fr example, fr any list, at least half the numbers in the list are n larger than the median, and at least half the numbers in the list are at least as large as the median (this is ne way f defining the median). The mean and SD als can tell us abut the fractins f values in a list in varius ranges. Suppse that a list f numbers cntains n negative number, and that 10% f the values in the list are greater than r equal t 50. What is the smallest the mean f the list culd be? The mean wuld be smallest if all the values in the list were as small as they culd be, subject t the cnstraints that the values were nt negative, and 10% equal r exceed 50. If 90% f the values were equal t zer, and the rest were equal t 50, that wuld give the smallest mean: 0 x 0.9 + 50 x 0.1 = 5. That is, if a list cntains n negative number, and 10% f the numbers in the list are 50 r larger, then the mean f the list must be at least 5. Mre generally, if any particular fractin f values in a list exceeds a given threshld, and nne f the values in the list is negative, then the mean f the list cannt be arbitrarily small. Markv's inequality turns this idea upside dwn t limit the fractin f numbers in a list that can exceed any given threshld, prvided the list cntains n negative number. The limit depends n the mean f the list, and the threshld. Markv's Inequality (fr lists) If the mean f a list f numbers is M, and the list cntains n negative number then: [fractin f numbers in the list that are greater than r equal t x] M/x. Nte 4-7: Heuristic derivatin f Markv's inequality. The basic idea is that f a see-saw r teeter-ttter: t balance a large weight, the ther weight shuld be as far as pssible frm the fulcrum. The cnstraint that nne f the elements in the list is belw zer limits hw far t the left we can put the balancing weights. Suppse the average f a list f nnnegative numbers is M, and that a fractin w f the elements f the list are at r abve sme value x. We want t make that fractin as large as pssible, while keeping the average equal t M. T make the fractin w as large as pssible, we shuld (a) put the rest f the list at zer, and Used by permissin. Page 16 f 23

(b) put the entire fractin w at the pint x (and nt partly abve x). The average is then M = w x x + (100% - w)x 0 = wxx. Slving fr w, we get w = m/x as the largest pssible fractin f numbers in the list that can be equal t r greater than x. This is Markv s inequality. Example 4-4: Applying Markv's inequality. There are 200 students in a class. The average amunt f mney in their pckets is $15. Hw many culd have $75 r mre in their pckets? SOLUTION: N student can have a negative amunt f mney in his r her pcket, s Markv's inequality applies. Markv's inequality guarantees that [fractin f students with at least $75 in their pckets] $15/$75 = 0.2 = 20%. Thus at mst 20% f the students (40 students) culd have $75 r mre in their pckets. If we knw the mean f a list and its SD, we knw smething abut hw many f the numbers in the list must be in varius ranges. Suppse that 25% f the numbers in a list differ frm the mean by 30 r mre. Hw small culd the SD f the list be? T make the SD smallest, all the numbers shuld be as clse as pssible t the mean, subject t the cnstraint that at least 25% f them differ frm the mean by 30 r mre. This is achieved by making 75% f the numbers equal t the mean, 12.5% equal t the mean minus 30, and 12.5% equal t the mean plus 30. Thus the SD f the list must be at least ( 0.125 x 302 + 0.75 x 0 2 + 0.125 x 30 2 )1/2 = 15. Mre generally, if a particular fractin f the values differ frm the mean f the list by at least a given threshld, then the SD f the list cannt be t small. Chebychev's inequality turns this arund t find a bund n the fractin f numbers in the list that differ frm the mean by mre than any given threshld. The bund depends n the SD f the list and the threshld. Used by permissin. Page 17 f 23

Chebychev's inequality (fr lists) If the mean f a list f numbers is M and the standard deviatin f the list is SD, then fr every psitive number k, [the fractin f numbers in the list that are kxsd r further frm M] 1/k2. Nte 4-8: Heuristic derivatin f Chebychev's inequality. Chebychev's inequality can be derived frm Markv's inequality, by cnsidering the list f squared deviatins frm the mean. The list f squared deviatins frm the mean cannt have negative entries, s average f squared deviatins frm the mean fractin f squared deviatins that are x r larger ------------------------------------------------------------. x The fractin f squared deviatins frm the mean that are x r larger is the same as the fractin f data that are x1/2 r mre frm the mean. Nw substitute x1/2 = kxsd. That gives average f squared deviatins frm the mean fractin f data that are kxsd r further frm mean ----------------------------------------------------------. (kxsd)2 Recall that the SD is the square-rt f the average f the squared deviatins frm the mean, s the numeratr n the right hand side is SD2. Substituting int the numeratr, and canceling the factr f SD2 in the numeratr with that in the denminatr, gives 1 fractin f data that are kxsd r further frm the mean ------, k 2 which is Chebychev's inequality. Chebychev's inequality says that nt t many f the numbers in a list can be far frm the mean, where far is measured in standard deviatins. Cnversely, if a large fractin f the values are far frm the mean, the SD f the list must be large. Table 4-5 lists sme specific bunds implied by Chebychev's inequality: Used by permissin. Page 18 f 23

Table 4-5: Bunds implied by Chebychev's inequality. Number f standard deviatins Largest pssible fractin f values this far r further frm the mean 1 100% 2 25% 3 11.11% 4 6.25% 5 4% 6 2.78% Example 4-5 illustrates applying Chebychev's inequality t find bunds n the fractin f weights in a given range frm the mean and SD f a list f weights. Example 4-5: Applying Chebychev's inequality. The mean weight f students in a certain class f students is 140 lbs, and the SD f their weights is 30 lbs. What fractin weighs between 90 lbs and 190 lbs? SOLUTION: We cannt get an exact answer, but we can get a lwer bund using Chebychev's inequality. The range frm 90 lbs t 190 lbs is the mean, plus r minus 50 lbs. 50 lbs is 1 2/3 times the SD f the weights, s accrding t Chebychev's inequality, the fractin f students wh weigh less than 90 lbs r mre than 190 lbs is at mst 1/(1 2/3)2 = 1/(1.6667)2 = 0.36 = 36%. Thus the fractin wh weigh between 90 lbs and 190 lbs is at least 100% - 36% = 64%. In sme prblems, it is pssible t apply bth Markv's inequality and Chebychev's inequality. When that happens, use whichever inequality gives the mre precise answer that is, the inequality that limits the fractin mst stringently. Example 4-6 illustrates this idea. Example 4-6: Smetimes Markv's inequality and Chebychev's inequality bth apply. Used by permissin. Page 19 f 23

On the average, it takes 45 minutes t crss the San Francisc Bay Bridge during rush hur. The SD f the time it takes t crss the bridge is 15 minutes. What's the largest fractin f the time it culd take mre than 2 hurs t crss the bridge? SOLUTION: Travel time is psitive, s we can use Markv s inequality. By Markv's inequality, [fractin f the time it takes mre than 2 hurs] (45 minutes)/(2 hurs) = (45 minutes)/(120 minutes) = 0.375 = 37.5%. On the ther hand, we can als apply Chebychev's inequality, as fllws. 2 hurs = 120 minutes = 45 minutes + 75 minutes = mean time + 75 minutes = mean time + 5SD That is, tw hurs is 5SD abve the mean. On the ther hand, 5SD belw the mean is 45 minutes - 5 x (15 minutes) = - 30 minutes. This is nt a pssible travel time (it always takes a psitive amunt f time t crss the bridge). Thus the fractin f the time it takes mre than 2 hurs r less than -30 minutes t crss the bridge must equal the time it takes mre than 2 hurs t crss the bridge. By Chebychev's inequality, [fractin f the time it takes less than -30 minutes r mre than 2 hurs] 1/52 = 1/25 = 4%. Because the fractin f the time it takes mre than 2 hurs r less than -30 minutes t crss the bridge is the same as the time it takes mre than 2 hurs, we have [fractin f the time it takes mre than 2 hurs] 4%. This is a mre restrictive bund than the ne Markv's inequality gives in this prblem (Markv's inequality gave 37.5%) s we shuld use it instead. (Larger lwer bunds are better; smaller upper bunds are better.) Used by permissin. Page 20 f 23

The fllwing exercises check yur ability t apply Markv's inequality and Chebychev's inequality. Summary Exercise 4-10. Accrding t Chebychev's inequality, at least what decimal fractin f a list f numbers must be within 5.6 SD f the mean? SOLUTION: By Chebychev's inequality fr lists, the fractin f bservatins beynd 5.6 SDs f the mean is at mst 1/(5.6)2 = 0.032 s the fractin within 5.6 SDs f the mean is at least 1-0.032 = 0.968. Exercise 4-11. A student has a GPA (grade pint average) f 3.5. In each curse she takes, she gets a grade between 0 (failing) and 4.0 (A+). What is the largest decimal fractin f her grades that culd be 4 r higher? SOLUTION: By Markv's inequality, the largest fractin f grades greater than r equal t 4 is at mst (3.5)/(4) = 0.88. Exercise 4-12. A certain type f light bulb has an average lifetime f 10,000 hurs. The SD f bulb lifetimes is 490 hurs. What decimal fractin f bulbs culd last mre than 12,303 hurs? SOLUTION: 0.0443 t 0.0463. This chapter intrduced several ways t summarize lists f numbers, quantitative data. Sme summaries, measures f lcatin, seek t be as clse as pssible t every element f the list t typify the elements. The mean, median, and mde are examples: They represent typical values f the list. The mean, median, and mde each are "as clse as pssible" t all the elements in the list, fr different definitins f the prximity f tw numbers: fr the mean, the distance between tw numbers is the square f their difference; fr the median, the distance between tw numbers is the abslute value f their difference; and fr the mde, the distance between tw numbers is 1 if the numbers differ, 0 if they are equal. The mean is the sum f the elements, divided by the number f elements. The median is the smallest element that is at least as large as at least half the elements. The mde is the mst cmmn value in the list. The mde makes sense fr qualitative and categrical data as well as quantitative data, but the mean and median make sense nly fr quantitative data. The mean, Used by permissin. Page 21 f 23

median, and mde differ in their sensitivity t changes t the data, r resistance. A statistic that can be changed arbitrarily by altering a single datum is nt resistant. The median is resistant. The mean is nt resistant. The resistance f the mde depends n the distributin f values in the list. The RMS (rt mean square) measures the average size f the elements f a list, withut regard t their signs. The RMS is nt resistant. Other summaries, measures f spread, reflect hw the values f the list differ frm each ther. Examples include the range, the SD (standard deviatin), and the IQR (inter-quartile range). The range f a list f numbers is the largest number minus the smallest number. The range is zer if and nly if all the numbers in the list are equal. The range is nt resistant. The SD measures the average size f the differences between the mean and the elements f the list: It is the RMS f the list f deviatins frm the mean. The SD f a list is zer if and nly if all the numbers in the list are equal. The SD is nt resistant. The IQR is the upper quartile minus the lwer quartile. It is the width f an interval that cntains the middle half f the data 25% belw the median and 25% abve the median. The IQR can be zer even if nt all the numbers are equal, but the middle 50% must be equal. The IQR is resistant. If the units f measurement change by an affine transfrmatin, measures f lcatin and spread in the new units f measurement have simple relatinships t their values in the ld units. Measures f lcatin and spread cntain a surprising amunt f infrmatin abut lists f numbers: Markv's inequality limits the fractin f elements f the list that exceed any given threshld, in terms f the mean f the list and the threshld, prvided the list cntains n negative number. Chebychev's inequality limits the fractin f elements whse difference frm the mean f the list exceeds any given threshld, in terms f the SD f the list and the threshld. Key Terms affine transfrmatin arithmetic mean average categrical Chebychev's inequality class interval deviatin discrete histgram independent Used by permissin. Page 22 f 23

interquartile range (IRQ) lwer quartile Markv's inequality mean measures f lcatin median mde mntnic functin percentile qualitative quartile range resistant RMS skewed spread standard deviatin (SD) statistics symmetric upper quartile variability Used by permissin. Page 23 f 23