MEASURES OF DISPERSION (VARIABILITY)

POLI 300 Hadout #7 N. R. Miller MEASURES OF DISPERSION (VARIABILITY) While measures of cetral tedecy idicate what value of a variable is (i oe sese or other, e.g., mode, media, mea), average or cetral or typical i a set of data, measures of dispersio (or variability or spread) idicate (i oe sese or other) the extet to which the observed values are spread out aroud that ceter how far apart observed values typically are from each other or from some average value (i particular, the mea). Thus: (a) (b) (c) if all cases have idetical observed values (ad thereby also all have the average value), dispersio is zero; if most cases have observed values that are quite close together (thereby also quite close to the average value), dispersio is low (but greater tha zero); but if may cases have observed values that are quite far apart from may others (or from the average value), dispersio is high. A measure of dispersio provides a summary statistic that idicates the magitude of such dispersio ad, like a measure of cetral tedecy, is a uivariate statistic. Because dispersio is cocered with how close together or far apart observed values are (i.e., with the magitude of the itervals betwee them), it should be apparet that the otio of dispersio make sese ad measures of dispersio are defied oly for iterval (or ratio) variables. (There is oe exceptio: a very crude measure of dispersio called the variatio ratio, which is defied for ordial ad eve omial variables. It will be discussed briefly i the Aswers & Discussio to PS #7.) There are two pricipal types of measures of dispersio: rage measures ad deviatio measures. Rage Measures Rage measures are based o the distace betwee (relatively) extreme values observed i the data ad are coceptually coected with the media as a measure of cetral tedecy (See the data illustratig Percetiles, the Media, ad Rages o the back page of the Hadout #6 o Measures of Cetral Tedecy.) The ( total or simple ) rage is the maximum (highest) value observed i the data (the value of the case at the 100th percetile) mius the miimum (lowest) value observed i the data (the value of the case at the 0th percetile) that is, the distace or iterval betwee the values of these two extreme cases. (Note that this may be less tha the rage of the possible values of the variable, sice logically possible extreme values may ot be observed i actual data; for example, the variable LEVEL OF TURNOUT has logically possible values ragig from 0% to 100%, but i U.S. Presidetial electios, the rage of observed values [as covetioally measured, i.e., as Total Vote for Presidet divided by Votig Age Populatio] over the past 60 years or so rages from a miimum observed of about 48% (i 1996) to about 64% (i 1960). The problem with the (total or simple) rage as a measure of dispersio is that it depeds o the values of just two cases cases that by defiitio have atypical (ad perhaps extraordiarily atypical) values. I particular, the rage

#7 Dispersio page 2 makes o distictio betwee a polarized distributio i which almost all observed values are close to either the miimum or maximum values ad a distributio i which almost all observed values are buched together but there are a few extreme outliers. Also the rage is udefied for theoretical distributios that are ope-eded (the techical term is asymptotic), like the ormal distributio (that we will take up i the ext topic) or the upper ed of a icome distributio type of curve (see PS #5C). Therefore other variats of the rage measure that do ot reach etirely out to the extremes of the frequecy distributio are ofte used i place of the total rage. The iterdecile rage is the value of the case that stads at the 90th percetile of the distributio mius the value of the case that stads at the 10th percetile that is, the distace or iterval betwee the values of these two less extreme cases. I like maer, the iterquartile rage is the value of the case that stads at the 75th percetile of the distributio mius the value of the case that stads at the 25th percetile. (The first quartile is the media observatio amog all cases that lie below the overall media ad the third quartile is the media observatio amog all cases that lie above the overall media. I these terms, the iterquartile rage is third quartile mius the first quartile.) We have previously used a rage measure i a special cotext. The hadout o Radom Samplig said the followig: Suppose the Gallup Poll takes a radom sample of respodets ad reports that the Presidet's curret approval ratig is 62% ad that this sample statistic has a margi of error of ± 3 %. Here is what this meas: if (hypothetically) Gallup were to take a great may radom samples of the same size from the same populatio (e.g., the America VAP o a give day), the differet samples would give differet statistics (approval ratigs), but 95% of these samples would give approval ratigs withi 3 percetage poits of the true populatio parameter. Thus, if our data is the list of sample statistics produced by the (hypothetical) great may radom samples, the margi or error specifies the rage betwee the value of the sample statistic that stads at the 97.5th percetile mius the sample statistic that stads at the 2.5th percetile (so that 95% of the sample statistics lie withi the rage). Specifically (ad lettig P be the value of the populatio parameter) this rage is (P + 3%)!(P! 3%) = 6%, i.e., twice the margi error. Deviatio Measures Deviatio measures are based o average deviatios from some average value. (Recall the discussio of Deviatios from the Average i Hadout #6 o Measures of Cetral Tedecy.) Sice we are dealig with iterval variables, we ca calculate meas, ad deviatio measures are typically based o the mea deviatio from the mea value. Thus the usual deviatio measures are coceptually coected with the mea as a measure of cetral tedecy. Suppose we have a variable X ad a set of cases umbered 1,2,...,. Let the observed value of the variable i each case be desigated x 1, x 2, etc. Thus: x 1 + x 2 +...+ x 3 x mea of X = xg = =.

#7 Dispersio page 3 The deviatio from the mea for a represetative case i is (x i! xg ). If almost all of these deviatios are small (if almost all cases are close to the mea value), dispersio is small; but if may of these deviatios are large (if may cases are much above or below the mea), dispersio is large. This suggests we could costruct a measure D of dispersio that would simply be the average (mea) of all the deviatios: (x 1! xg ) + (x 2! xg ) +... + (x! xg ) 3 (x i! xg ) D = =. But this will ot work, because some of the deviatio are positive ad others are egative ad, as we saw earlier (Hadout #6, poit (d) uder Deviatios from the Average), these positive ad egative deviatios ecessarily balace out ad add up to zero, i.e., for ay distributio of observed values 3(x i! xg ) = 0. A practical way aroud this problem is simply to igore the fact that some deviatios are egative while others are positive by averagig the absolute values of the deviatios (i effect, by igorig the egative sig before each egative deviatio): 3 *x i! xg* MD =. This measure (called the mea deviatio) tells us the average (mea) amout that the values for all cases deviate (regardless of whether they are higher or lower) from the average (mea) value. Ideed, this is a ituitive, uderstadable, ad perfectly reasoable measure of dispersio, ad it is occasioally used i research. However, statisticias are mathematicias, ad they dislike this measure because the formula is mathematically messy by virtue of beig o-algebraic (i that it igores egative sigs). Therefore statisticias, ad most researchers, use aother slightly differet deviatio measure of dispersio that is algebraic, ad that makes use of the fact that the square of ay (positive or egative) umber (i.e., the umber multiplied by itself) other tha zero is itself always positive. This formula is based o fidig the average of the squared deviatios; sice these are all o-egative, they do ot balace out. This measure of dispersio is called the variace of the variable. 3 (x i! xg ) 2 Variace of X = Var(X) = s 2 =. That is, the variace is the average squared deviatio from the mea. Remember from Hadout #6 (poit (e) uder Deviatios from the Average) that the average squared deviatio from the mea value of X is smaller tha the average squared deviatio from ay other value of X. The variace is the usual measure of dispersio i statistical theory, but it has a drawback whe researchers wat to describe the dispersio i data i a practical way. Whatever uits the origial data (ad its average values ad its mea dispersio) are expressed i, the variace is expressed i the square of those uits, ad thus it does't make much ituitive or practical sese. This ca be remedied by fidig the (positive) square root of the variace (which takes us back to the origial uits). This measure of dispersio is called stadard deviatio of the variable:

#7 Dispersio page 4 / 3 (x i! xg ) Stadard Deviatio of X = SD(X) = s = / 2. r I order to iterpret a stadard deviatio, or to make a plausible estimate of the SD of some data, it is useful to thik of the mea deviatio because (i) it is easier to estimate the magitude of the mea deviatio ad (ii) the stadard deviatio has approximately the same umerical magitude as the mea deviatio. More precisely, give ay distributio of data, the stadard deviatio is ever less tha the mea deviatio; it is equal to the mea deviatio if the data is distributed i a maximally polarized fashio; otherwise the SD is somewhat larger typically about 20-50% larger. Sample Estimates of Populatio Dispersio Radom sample statistics that are percetages or averages provide ubiased estimates of the correspodig populatio parameters. However, sample statistics that are dispersio measures provide estimates of populatio dispersio that are biased (at least slightly) dowward. This is most obvious i the case of the rage; it should be evidet that a sample rage is almost always smaller, ad ca ever be larger, tha the correspodig populatio rage. The sample stadard deviatio (or variace) is also biased slightly dowward. (While the SD of a particular sample ca be larger tha the populatio SD, sample SDs are o average slightly smaller tha the correspodig populatio SDs). However, the sample SD ca be adjusted to provide a ubiased estimate of the populatio SD; this adjustmet cosists of dividig the sum of the squared deviatios by!1, rather tha by. (Clearly this adjustmet makes o practical differece uless the sample is quite small. Notice that if you apply the SD formula i the evet that you have just a sigle observatio i your sample, i.e., = 1, it must give SD = 0 regardless of what the observed value is. More ituitively, you ca get o sese of how much dispersio there is i a populatio with respect to some variable util you observe at least two cases ad ca see how far apart they are.) This is why you will ofte see the formula for the variace ad SD with a!1 divisor (ad scietific calculators ofte build i this formula). However, for POLI 300 problem sets ad tests, you should use the formula give i the previous sectio of this hadout. Dispersio i Ratio Variables Give a ratio variable (e.g. icome), the iterestig dispersio questio may pertai ot to the iterval betwee two observed values or betwee a observed value ad the mea value but to the ratio betwee the two values. (For example, oe household poverty level is defied as oe half the media household icome, ad households with more tha twice the media icome are sometimes characterized as well off. The average compesatio of CEOs today is about 250 times that of the average worker, whereas 50 years it was oly about 40 times that of the average worker.) The degree of dispersio i ratio variables ca aturally be referred to as the degree iequality. Oe ratio measure of dispersio/iequality is the coefficiet of variatio, which is simply the stadard deviatio divided by the mea. Aother is the Gii Idex of Iequality, which is based o a compariso betwee the actual cumulative distributio whe cases are raked ordered from lowest

#7 Dispersio page 5 to highest value (e.g., from poorest to richest) ad the cumulative distributio that would exist if all cases had the same value. How to Compute a Stadard Deviatio The formula for the stadard deviatio is: SD(X) = s = 3 (x i! xg ) / 2. r Here is how to use the formula. 1. Set up a worksheet like the oe show below. 2. I the first colum, list the values of the variable X for each of the cases. (This is the raw data.) 3. Fid the mea value of the variable i the data, by addig up the values i each case ad dividig by the umber of cases. 4. I the secod colum, subtract the mea from each value to get, for each case, the deviatio from the mea. Some deviatios are positive, others egative, ad (apart from roudig error) they must add up to zero; add them up as a arithmetic check. 5. I the third colum, square each deviatio from the mea, i.e., multiply the deviatio by itself. Sice the product of two egative umbers is positive, every squared deviatio is oegative, i.e., either positive or (i the evet a case has a value that coicides with the mea value). 6. Add up the squared deviatios over all cases. 7. Divide the sum of the squared deviatios by the umber of cases; this gives the average squared deviatio from the mea, commoly called the variace. 8. The stadard deviatio is the (positive) square root of the variace. (The square root of x is that umber which whe multiplied by itself gives x.)

#7 Dispersio page 6