Summarizig data Summary statistics for cetral locatio. Sample mea ( 樣本平均 ): average; ofte deoted by X. Sample media ( 樣本中位數 ): the middle umber or the average of the two middle umbers for the sorted data. Sample media is less sesitive to etreme values i the data tha the sample mea. Eample. Cosider the sample (2, 7, ). Fid the sample mea ad the sample media. Sol. The sample mea is (2 + 7 + ) = 4. The sorted data are 2,, 7, so the middle umber is, which is the sample media. Software sol. To fid the sample mea ad media for the sample (2, 7, ) usig R, at the R prompt, eter <- c(2,7,); mea(); media() The R returs the sample mea ad sample media. For omial data, we use mode to describe the cetral locatio istead of usig sample mea/media. Eample 2. Studets may get to school by various meas. Below is a summary table of trasportatio tools for a class of 0 studets, where the codig rule for differet tools is as follows: for bus, 2 for feet, for motorcycle ad 4 for other. The mode for the sample is. trasportatio tool 2 4 cout 0 8 7 5 Summary statistics for dispersio. sample mea X. Let (X,..., X ) be a sample with Mea deviatio: X i X. Sample variace ( 樣本變異數 ): ( (X i X) 2 = ) Xi 2 X 2. i= i= i=
Sample stadard deviatio ( 樣本標準差 ): sample variace. Eample. Cosider the sample (2, 7, ). Fid the mea deviatio, the sample variace ad sample stadard deviatio. Sol. From Eample, the sample mea for the sample (2, 7, ) is 4, so the mea deviatio is the sample variace is ( 2 4 + 7 4 + 4 ) = 2, ( (2 4) 2 + (7 4) 2 + ( 4) 2) = ( 2 2 + 7 2 + 2 4 2) = 7 ad the sample stadard deviatio is 7 2.64575. To fid the sample variace ad sample stadard deviatio for the sample (2, 7, ) usig R, at the R prompt, eter <- c(2,7,); var(); sd() The R returs the sample variace ad sample stadard deviatio. Chebyshev s Theorem. For a sample (X,..., X ) with sample mea X ad sample stadard deviatio S, ( umber of Xi s such that X i X S ) 2. Eample 4. Suppose that we have a sample of 000 eam scores, where the sample mea ad sample stadard deviatio are 75 ad 2 respectively. At least what percet of the scores are betwee 70 ad 80? Sol. Note that (80 75)/2 = 2.5 ad (70 75)/2 = 2.5, so the rage 75 ± (2.5)(2) is the rage from 70 to 80. Tae = 2.5 ad apply Chebyshev s Theorem, the at least /(2.5) 2 = 84% of the scores are withi the rage 75±(2.5)(2), so at least 84% of the scores are betwee 70 ad 80. Eample 5. Suppose that we have a sample of 000 eam scores, where the sample mea ad sample stadard deviatio are 75 ad 2 respectively. Fid a rage that covers at least 80% of the scores. Sol. Solvig / 2 = 0.8 gives = 5. By Chebyshev s Theorem, at least 80% of the scores are i the rage from 75 2 5 70.52786 to 75 + 2 5 79.4724. Histogram costructio for a sample (X,..., X ) based o Scott s rule. 2
. Determie : the umber of classes. Choose to be the smallest umber such that (24, π) / S /.5 S / where S is the sample stadard deviatio. 2. Determie the class width (called class iterval i the tet). Let I be the class width, the I = or I ca be the smallest umber so that I ad I is a multiple of I 0, where I 0 is chose for coveiece (usually 0 or 00).. Determie the class limits for each class. Remars. Put approimately equal amouts of the ecess i each of the two tails. Use coveiet class limits; mae the lower limit of the first class a multiple of the class width if possible. I the tetboo, is chose so that 2, which was suggested by Sturges (926). Scott (979) proposed to use class width (24 π) / σ /, where σ ca be estimated by the sample stadard deviatio S. The costat (24 π) / σ i Scott s rule is chose to miimize the itegrated mea squared error for the ormalized histogram as a desity estimator whe the sample is a radom sample from a ormal distributio (we will lear about desity, radom sample ad ormal distributio later). Note that Steps 2 ad ca be simplified by taig the class width I =, but here we tae the class width ad class limits to be a multiple of I 0 to mae it easier to read the resultig frequecy table.
Eample 6. For a sample of size 999 with miimum 5546, maimum 5925 ad sample stadard deviatio 4.289, determie the umber of classes for drawig a histogram usig Scott s rule. Sol. Choosig the smallest such that gives = 8. 5925 5546 (24 π) / 4.289 (999) / 5925 5546 7.6.5 4.289 (999) / Drawig a histogram usig R. Suppose that the sample has bee geerated ad stored i a vector i R by ruig <- qorm(seq(0.00, -0.00, 0.00))*20000/6 <- -mi()+5546; <- c([<5925], 5925) Below are the R codes for drawig a histogram for based o Scott s rule. c.5 <- (24*sqrt(pi))^(/) width <- c.5*sd()*legth()^(-/) rage <- ma()-mi() <- ceilig(rage/width) brs <- seq(mi(), by=rage/, legth.out=+) hist(, breas=brs) For a histogram that shows a shape with a uique pea (the mode), we ca tell from the histogram. the cetral locatio of the data, 2. the rage for most of the data (for eample the rage for the middle 50% of the data), ad. whether the shape is symmetric about the pea. If the shape of the histogram is essetially symmetric about the pea, the the mode ad the media for the bied data are approimately the same. It is atural to use the pea locatio as the cetral locatio of the data. If the histogram is essetially asymmetric, the the mode ad the media are ot the same. For a histogram that shows more tha oe pea, we ca still tell where most of the data are located from the histogram. Try to determie the cetral locatio(s) ad the rage for most of the data for each of the followig histogram. Left-upper histogram. Mode ad Media: 0. At least 50% of the data are betwee -.5 ad.5. All data are betwee -4 ad 4. 4
Histogram of Histogram of 0 00 200 00 400 500 600 700 0 00 200 00 400 500 600 4 2 0 2 0.002 0.000 0.002 0.004 Histogram of Histogram of 0 00 200 00 400 500 600 0 200 400 600 0.0 0.2 0.4 0.6 0.8 4 2 0 2 4 Right-upper histogram. Mode ad Media: 0. At least 50% of the data are betwee -0.005 ad 0.005. All data are betwee -0.004 ad 0.004. Left-bottom histogram. Mode: 0.286. Media > 0.286. At least 50% of the data are betwee 0.2 ad 0.6. All data are betwee 0 ad. Right-bottom histogram. Most data are ear -2 or 2. At least 25% of the data are betwee - ad - ad at least aother 25% of the data are betwee ad. All data are betwee -6 ad 6. Refereces [] D. W. Scott, O optimal ad data-based histograms, Biometria, 66 (979), pp. 605 60. [2] H. A. Sturges, The choice of a class iterval, Joural of the America Statistical Associatio, 2 (926), pp. 65 66. 5