Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Ismor Fscher, 8//008 Stat 54 / -8.3 Summary Statstcs Measures of Center and Spread Dstrbuton of dscrete contnuous POPULATION Random Varable, numercal True center =??? True spread =???? parameters ( populaton characterstcs ) unknown fxed numercal values usually denoted by Greek letters, e.g., θ ( theta )? Statstcal Inference SAMPLE, sze n Measures of center medan, mode, mean Measures of spread range, varance, standard devaton statstcs ( sample characterstcs ) known (or computable) numercal values obtaned from sample data estmators of parameters, e.g., ˆ θ usually denoted by correspondng Roman letters

Ismor Fscher, 8//008 Stat 54 / -9 Measures of Center For a gven numercal random varable, assume that a random sample {x, x,, x n } has been selected, and sorted from lowest to hghest values,.e., x x x n x n 50% 50% sample medan = the numercal mddle value, n the sense that half the data values are smaller, half are larger. If n s odd, take the value n poston # n +. If n s even, take the average of the two closest neghborng data values, left (poston # n ) and rght (poston # n + ). Comments: The sample medan s robust (nsenstve) wth respect to the presence of outlers. More generally, can also defne quartles (Q = 5% cutoff, Q = 50% cutoff = medan, Q 3 = 75% cutoff), or percentles (a.k.a. quantles), whch dvde the data values nto any gven p% vs. (00 p)% splt. Example: SAT scores sample mode = the data value wth the largest frequency (f max ) Comment: The sample mode s robust to outlers. If present, repeated sample data values can be neatly consoldated n a frequency table, vs-à-vs the correspondng dotplot. (If a value x s not repeated, then ts f =.) k dstnct data values of absolute frequency of x relatve frequency of x x f f (x ) = f / n x f f (x ) x f f (x ) x k f k f (x k ) n f f f fmax....... x x mean mode x k f k

Ismor Fscher, 8//008 Stat 54 / -0 Example: n = random sample values of = Body Temperature ( F) : {98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99., 99., 99.} x f f (x ) 98.5 / 98.6 5 5/ 98.9 3 3/ 99. / 99. / n = f 5 3 98.5 98.6 98.7 98.8 98.9 99.0 99. 99. 98.6 + 98.9 sample medan = sample mode = 98.6 F sample mean = = 98.75 F (sx data values on ether sde) [ (98.5)() + (98.6)(5) + (98.9)(3) + (99.)() + (99.)() ] or, = (98.5) + (98.6) 5 + (98.9) 3 + (99.) + (99.) = 98.8 F sample mean = the weghted average of all the data values x = = k x f n =, where f s the absolute frequency of x k = x f ( x ) f, where f(x ) = n s the relatve frequency of x Comments: The sample mean s the center of mass, or balance pont, of the data values. The sample mean s senstve to outlers. One common remedy for ths Trmmed mean: Compute the sample mean after deletng a predetermned number or percentage of outlers from each end of the data set, e.g., 0% trmmed mean. Robust to outlers by constructon. 0% 0%

Ismor Fscher, 8//008 Stat 54 / - Grouped Data Suppose the orgnal values had been lumped nto categores. Example: x Class Interval Recall the grouped Memoral Unon age data set Frequency f Relatve Frequency f n Densty (Rel Freq Class Wdth) 5 [0, 0) 4 0.0 0.0 5 [0, 30) 8 0.40 0.04 45 [30, 60) 8 0.40 0.03 n = 0.00 group mean: Same formula as above, wth x = mdpont of class nterval. x group = 0 [ (5)(4) + (5)(8) + (45)(8) ] = 3.0 years th Exercse: Compare ths value wth the ungrouped sample mean x = 9. years. group medan (& other quantles): Densty Hstogram 0.30.0 By defnton, the medan Q dvdes the data set nto equal halves,.e., 0.50 above and below. In ths example, t must therefore le n the class nterval [0, 30), and dvde the 0.40 area of the correspondng class rectangle as shown. Snce the 0.0 strp s ¼ of that area, t proportonally follows that Q must le at ¼ of the class wdth 30 0 = 0, or.5, from the rght endpont of 30. That s, Q = 30.5, or Q = 7.5 years. (Check that the ungrouped medan = 5 years.) 0.0 0.40 Q

Ismor Fscher, 8//008 Stat 54 / - Formal approach ~ Densty A B a Q b Frst, dentfy whch class nterval [a, b) contans the desred quantle Q (e.g., medan, quartle, etc.), and determne the respectve left and rght areas A and B nto whch t dvdes the correspondng class rectangle. Equatng proportons for Densty = A + B b a, we obtan Densty = A B = Q a b Q, from whch t follows that A Q = a + or Densty Q = b B Densty or Ab+ Ba Q =. A+ B For example, n the grouped Memoral Unon age data, we have a = 0, b = 30, and A = 0.30, B = 0.0. Substtutng these values nto any of the equvalent formulas above yelds the medan Q = 7.5. Exercse: Now that Q s found, use the formula agan to fnd the frst and thrd quartles Q and Q 3, respectvely. Note also from above, we obtan the useful formulas A = ( Q a) Densty B = ( b Q) Densty for calculatng the areas A and B, when a value of Q s gven! Ths can be used when fndng the area between two quantles Q and Q. (See next page for another way.)

Ismor Fscher, 8//008 Stat 54 / -3 Alternatve approach ~ Class Interval Frequency Relatve Frequency f f / n Cumulatve Relatve Frequency f f f n n n F = + + + I 0 0 0 0 I f f / n F I f f / n F I Q =? n f f / n [ a, b ) f + f / n + hgh F low < 0.5 0.5 F > 0.5 Ik f f / n k k n Then Q 0.5 F low = a + ( b a) F F hgh low Fhgh 0.5 or Q = b ( b a ). F F hgh low Agan, n the grouped Memoral Unon age data, we have a = 0, b = 30, F low = 0., and F hgh = 0.6 (why?). Substtutng these values nto ether formula yelds the medan Q = 7.5. To fnd Q, replace the 0.5 n the formula by 0.5; to fnd Q 3, replace the 0.5 n the formula by 0.75, etc. Conversely, f a quantle Q s gven, then we can solve for the cumulatve relatve 0.5 Flow frequency up to that value: F = Flow + ( b a). It follows that the relatve Q a frequency (.e., area) between two quantles Q and Q s equal to the dfference between ther cumulatve relatve frequences: F(Q ) F(Q ).

Ismor Fscher, 8//008 Stat 54 / -4 Shapes of Dstrbutons Symmetrc dstrbutons correspond to values that are spread equally about a center. mean = medan Examples: (Drawn for smoothed hstograms of a random varable.) unform trangular bell-shaped Note: An mportant specal case of the bell-shaped curve s the normal dstrbuton, a.k.a. Gaussan dstrbuton. Example: = IQ score Otherwse, f more outlers of occur on one sde of the medan than the other, the correspondng dstrbuton wll be skewed n that drecton, formng a tal. skewed to the left (negatvely skewed) skewed to the rght (postvely skewed) 0.5 0.5 0.5 0.5 mean < medan medan < mean Examples: = calcum level (mg) = serum cholesterol level (mg/dl) Furthermore, dstrbutons can also be classfed accordng to the number of peaks : unmodal bmodal multmodal

Ismor Fscher, 8//008 Stat 54 / -5 Measures of Spread Agan assume that a numercal random sample {x, x,, x n } has been selected, and sorted from lowest to hghest values,.e., x x x n x n sample range = x n x (hghest value lowest value) Comments: Uses only the two most extreme values. Very crude estmator of spread. The sample range s extremely senstve to outlers. One common remedy Interquartle range (IQR) = Q 3 Q. Robust to outlers by constructon. 5% 5% 5% 5% Q Q Q 3 If the orgnal data are grouped nto k class ntervals [a, a ), [a, a 3 ),, [a, a k k+), then the group range = a k+ a. A smlar calculaton holds for group IQR. Example: The Body Temperature data set has a sample range = 99. 98.5 = 0.7 F. {98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99., 99., 99.} x f 98.5 98.6 5 98.9 3 99. 99. n =

Ismor Fscher, 8//008 Stat 54 / -6 For a much less crude measure of spread that uses all the data, frst consder the followng Defnton: x x = ndvdual devaton of the sample data value from the sample mean x 98.8 x x th 98.5 0.3 98.6 0. 5 98.9 +0. 3 99. +0.3 99. +0.4 n = Navely, an estmate of the spread of the data values mght be calculated as the average of these n = ndvdual devatons from the mean. However, ths wll always yeld zero! f FACT: k ( x x) f = 0, =.e., the sum of the devatons s always zero. Check: In ths example, the sum = ( 0.3)() + ( 0.)(5) + (0.)(3) + (0.3)() + (0.4)() = 0. Exercse: Prove ths general fact algebracally. Interpretaton: The sample mean s the center of mass, or balance pont, of the data values. f 5 98.5 98.6 98.7 98.8 98.9 99.0 99. 99. 3

Ismor Fscher, 8//008 Stat 54 / -7 Best remedy: To make them non-negatve, square the devatons before summng. sample varance s = n k (x x) f = s s not on the same scale as the data values! sample standard devaton s = + s s s on the same scale as the data values. Example: x x x (x x) f 98.5 0.3 +0.09 98.6 0. +0.04 5 98.9 +0. +0.0 3 99. +0.3 +0.09 99. +0.4 +0.6 n = Then s = [ (0.09)() + (0.04)(5) + (0.0)(3) + (0.09)() + (0.6)() ] = 0.06 ( F), so that s = 0.06 = 0.45 F. Body Temp has a small amount of varance. Comments: ( x ) s x f = has the mportant frequently-recurrng form SS n df, where SS = Sum of Squares (sometmes also denoted S xx ) and df = degrees of freedom = n, snce the n ndvdual devatons have a sngle constrant. (Namely, ther sum must equal zero.) Same formulas are used for grouped data, wth xgroup, and x = class nterval mdpont. Exercse: Compute s for the grouped and ungrouped Memoral Unon age data. A related measure of spread s the absolute devaton, defned as n x x f, but ts statstcal propertes are not as well-behaved as the standard devaton. Also, see Appendx > Geometrc Vewpont > Mean and Varance, for a way to understand the sum of squares formula va the Pythagorean Theorem (!), as well as a useful alternate computatonal formula for the sample varance.