Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

1 (*) If a lot of the data is far from the mea, the may of the (x j x) 2 terms will be quite large, so the mea of these terms will be large ad the SD of the data will be large. (*) I particular, outliers ca make the SD bigger. (Outliers have a eve bigger effect o the rage of the data.) (*) O the other had, if the data is all clustered close to the mea, the all of the (x j x) 2 terms will be fairly small, so their mea will be small ad the SD will be small. To be cotiued...

Example: Fid the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. 2

2 Example: Fid the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. Step 1. Fid the mea: x = 2 + 4 + 5 + 8 + 5 + 11 + 7 7 = 42 7 = 6.

2 Example: Fid the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}. Step 1. Fid the mea: x = 2 + 4 + 5 + 8 + 5 + 11 + 7 = 42 7 7 = 6. Step 2. Fid the mea of the squared deviatios of the umbers from their mea: (2 6) 2 + (4 6) 2 + (5 6) 2 + + (7 6) 2 7 = 52 7.

3 (*) (Very) useful shortcut (for calculatios doe by had): 1 (xj x) 2 = ( 1 x 2 j ) (x) 2 so SD x = 1 (xj x) 2 = ( 1 x 2 j ) (x) 2 Check with example: {x j } = {2, 4, 5, 8, 5, 11, 7} ad x = 6: ( 1 7 x 2 j ) x 2 = 304 7 36 = 52 7

4 Very useful special case: All the umbers i the data are 0s ad 1s. m 1s ad m 0s ( umbers i all).

4 Very useful special case: All the umbers i the data are 0s ad 1s. m 1s ad m 0s ( umbers i all). x = m {}}{ 1 + 1 + + 1 + m {}}{ 0 + 0 + + 0 = m (I.e., the average is equal to the proportio of 1s i the data). m m {}}{{}}{ 1 2 + 1 2 + + 1 2 + 0 2 + 0 2 + + 0 2 ( m ) 2 SD x = m ( m ) 2 m ( = = 1 m ) = m m m = (proportio of 1s) (proportio of 0s)

5 SD vs. SD + Oe of the most importat uses of sample statistics is to estimate the correspodig populatio parameters. The mea of a represetative sample is a good estimate of the mea of the populatio that the sample represets. The SD of a represetative sample teds to uderestimate the SD of the populatio from which it was draw. To correct for this, statisticias use the SD + of the sample to estimate the SD of the populatio. If is sample size, the 1 SD + = 1 SD sample = (xj x) 1 2 If the sample size is large, the there is o sigificat differece betwee SD ad SD + because /( 1) 1 whe is large. The SD + is called the sample stadard deviatio.

6 How is the data clustered? The proportio of the data that lies more tha k SDs from the mea is always less tha 1/k 2. This fact is kow as Chebychev s iequality, ad follows directly from how the stadard deviatio is defied. For example, less tha 1/4 = 25% of the values i ay data set lie more tha 2 SDs from the average value (mea). Less tha 1/9 11.11% of the data lie more tha 3 SDs from the average value. Etc. Turig this aroud, more tha 75% of the data lie withi 2 SDs of the mea, ad more tha 88.88% of the data lie withi 3 SDs of the mea. The estimates above are true for ay set of data. O the other had, if we kow more about the data, the we ca ofte get sharper estimates.

For certai types of data sets, almost all of the data lies withi two or three SDs of the average. Example (from the book): h = 63.5 iches ad SD h 3 iches... 7 Statistics, Fourth Editio Copyright 2007 W. W. Norto & Co., Ic.

9 Stadard uits (*) We commoly measure the distace of data to their average i terms of the stadard deviatio of the data set... This leads to the cocept of stadard uits. If x j comes from a distributio with average x ad stadard deviatio SD x, we covert x j to its stadard uits, z j, by settig z j = x j x SD x. (*) z j tells us how far x j is from x as a multiple of SD x. (*) If z j > 0, the x j is above average; if z j < 0, the x j is below average. (*) Stadard uits are pure umbers. This meas that there are o uits of measuremet (iches, dollars, etc.) associated with stadard uits. (*) The stadard uits value z j of a give datum x j is also called the z-score of x j.

10 Example. Suppose that the average Jauary temperature i Poduk is 45 F, with a SD of 2 F, while i Whoville the average Jauary temperature is 25 F with a SD of 5 F. O Jauary 20th, the temperature i Whoville was 16 F ad i Poduk it was 38 F. Where was the temperature more uusual that day? We ca aswer this by covertig the temperatures o Jauary 20th i both tows to stadard uits: z p = 38 45 2 = 3.5 ad z w = (*) Both temperatures were below average. 16 25 5 = 1.8. (*) The z-score for Poduk is more egative tha the z-score for Whoville, so from a statistical poit of view the temperature i Poduk was more uusual that day. (*) The larger z j, the more uusual x j is.

11 Observatio. Covertig ay set of data, {x 1, x 2,..., x } with average x ad stadard deviatio SD x = s, to stadard uits produces a set of umbers {z 1, z 2,..., z } with average z = 0 ad stadard deviatio SD z = 1. Because arithmetic... z = z 1 + z 2 + + z = = = x 1 x s x 1 x + x 2 x s + x 2 x x 1 +x 2 + +x = x x s = 0 s + + x x s + + x x s {}}{ x + x + x

12 ad more arithmetic SD z = = = = = z 2 1 + z 2 2 + + z2 ( x1 x s ) 2 ( + x2 x s ) 2 ( + + x ) x 2 s (x 1 x) 2 s + (x 2 x) 2 2 s + + (x x) 2 2 s 2 (x 1 x) 2 +(x 2 x) 2 + +(x x) 2 s 2 (x 1 x) 2 +(x 2 x) 2 + +(x x) 2 s 2 = s s = 1

13 The ormal approximatio, I Differet sets of data may be see to have very similar distributios, oce they have bee coverted to stadard uits. Covertig to stadard uits moves the ceter of the histogram (the average of the data) to 0, ad scales the data as a whole so that oe SD is coverted to 1 uit. I may cases, the histogram of the data, oce coverted to stadard uits, takes o a somewhat bell-shaped form the form of the ormal curve. The ormal curve is the graph of the fuctio (where e = 2.7182818...). y = 1 2π e z2 /2,

14 50 % per Stadard Uit 25 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z The ormal curve is symmetric aroud the lie z = 0, ad the total area uder the curve is equal to 1 (or 100%, if you prefer).

15 Example: The distributio of heights of wome age 18 ad over i HANES5 (Health ad Nutritio Examiatio Study, 03-04) appears i the histogram below (from page 81 i chapter 5 of FPP). The average height is 63.5 ad the SD is about 3. The shaded regio represets the heights that fall withi oe SD of average.

16 To see how well the distributio of the height data is approximated by the ormal curve, we must covert the data to stadard uits ad sketch the histogram for the stadardized (or ormalized) data. To save a lot of drawig time, we observe that the coversio to stadard uits is just a rescalig. This meas that istead of actually covertig all of the heights to their stadard uits ad the drawig a ew histogram, we ca simply chage the horizotal ad vertical scales o the origial histogram.

18 If the (rescaled) histogram is well-approximated by the ormal curve, the area of regios uder the histogram will be approximately equal to areas uder the ormal curve for the same rage of stadard uits. I.e., the percetage of the data that lies withi 1 SD of the average will be approximately equal to the area uder the ormal curve betwee -1 ad 1; the percetage of the data lyig withi 2 SDs of the average will be approximately equal to the area uder the ormal curve betwee -2 ad 2; ad so forth. This is useful, because the distributio of the area uder the ormal curve is well-uderstood. I particular...

19 50 % per Stadard Uit 25 68% 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z (*) The area uder the ormal curve betwee 1 ad 1 is 0.68 = 68%.

20 50 % per Stadard Uit 25 95% 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z (*) The area uder the ormal curve betwee 2 ad 2 is 0.95 = 95%.

21 50 % per Stadard Uit 25 99% 5-4 -3-2 -1 0 1 2 3 4 Stadard Uits z (*) The area uder the ormal curve betwee 3 ad 3 is 0.99 = 99%.

22 Rule of thumb : If a set of data has a approximately ormal distributio, the: About 68% of the data lies withi oe SD of average; About 95% of the data lies withi two SDs of average; About 99% of the data lies withi three SDs of average; Remember: This rule oly applies to data that is (approximately) ormally distributed! Abset that coditio (or assumptios about how the data is distributed) we rely o weaker (but more geeral) estimates (like Chebychev s iequality). To calculate areas uder the ormal curve for regios other tha those above ( 1 to 1, 2 to 2 ad 3 to 3), we use a ormal table, like the oe foud i the back of the textbook.

A ormal table 23

(From Statistics, 4th ed., W.W.Norto & Co., Ic.) Copyright 200 24

25 Usig the ormal table (i) The table i the appedix gives the areas for symmetric regios z 0 z z 0 (as percetages), where 0 z 0 4.45. If z 0 4.50, you ca assume that the correspodig area is 99.9999%. Example: Suppose that the heights of me aged 25 35 i a certai city are distributed (approximately) ormally with a average of 67 iches ad a stadard deviatio of 2.5 iches. What percetage of these me are betwee 65 ad 69 iches tall? a. A height of 65 iches correspods to 65 67 2.5 = 0.8 stadard uits, ad 69 iches correspods to 69 67 2.5 = 0.8 stadard uits. b. The percetage we wat is (approximately) equal to the area uder the ormal curve betwee 0.8 ad 0.8 which is equal to the table etry for z 0 = 0.8, which is 57.63%.

26 (ii) The ormal curve is symmetric aroud z = 0 so the area uder the curve betwee 0 ad z 0 is equal to the area uder the curve betwee z 0 ad 0, ad both are equal to exactly oe half the table etry for z 0. 50 50 % per Stadard Uit 25 = % per Stadard Uit 25-4 -3-2 -1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 0 -z 0 Stadard Uits Example. What percetage of the me i the previous example are betwee 67 ad 70 iches tall? a. 67 iches is average which correspods to 0 stadard uits ad 70 iches correspods to 70 67 2.5 = 1.2 stadard uits. b. The percetage we wat is (approximately) equal to the area uder the ormal curve betwee 0 ad 1.2 which is equal to half the table etry for z 0 = 1.2. This is 76.99/2% 38.5%.

27 (iii) If z 0 > 0, the the area uder the ormal curve to the left of z 0 is equal to 50% plus half the table etry for z 0, because... 50 % per Stadard Uit 25 = -4-3 -2-1 0 1 2 3 4 Stadard Uits z 0 50 50 % per Stadard Uit 25 + % per Stadard Uit 25 = 50% + 1 2 Table(z 0). -4-3 -2-1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 0 Stadard Uits z 0 Example. What percetage of the me i the previous examples are less tha six feet, two iches tall? Six feet, two iches is 74 iches which correspods to 74 67 2.5 = 2.8 stadard uits. The table etry for 2.8 is 99.49%, so the percetage of me who are uder 74 iches tall is 50% + 99.49% 2 99.75%.

28 (iv) If z 0 > 0, the the area uder the ormal curve to the right of z 0 is equal to 50% half the table etry for z 0, because 50 % per Stadard Uit 25 = -4-3 -2-1 0 1 2 3 4 Stadard Uits z 0 50 50 % per Stadard Uit 25 % per Stadard Uit 25 = 50% 1 2 Table(z 0). -4-3 -2-1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 0 Stadard Uits z 0 Example. What percetage of the me are taller tha 68 iches? 68 iches correspods to 68 67 2.5 = 0.4, so the percetage of me who are more tha 68 iches tall is (approximately) 50% 31.08% 2 = 34.46%.

29 (*) The areas of other types of regios uder the ormal curve ca be calculated from the table by usig (i) (iv) ad the symmetry of the ormal curve aroud 0. For example, if 0 < z 0 < z 1, the the area uder the ormal curve betwee z 0 ad z 1 is because = 1 2 Table(z 1) 1 2 Table(z 0) 50 % per Stadard Uit 25 = -4-3 -2-1 0 1 2 3 4 z 0 z 1 Stadard Uits 50 50 % per Stadard Uit 25 % per Stadard Uit 25-4 -3-2 -1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4 Stadard Uits z 1 z 0 Stadard Uits