STA 666 Fall 2007 Web-based Course Notes 4: Describig Distributios Numerically Numerical summaries for quatitative variables media ad iterquartile rage (IQR) 5-umber summary mea ad stadard deviatio Media ad IQR The media is the value which divides the ordered data values i half. A geeral formula for the positio of the media is (+)/2. Example: = 5 gives 3 as the positio of the media (the 3 rd ordered value); = 6 gives 3.5 which meas halfway betwee the 3 rd ad 4 th ordered values (which is the average of the two middle values). The media is a measure of the ceter of a distributio. The iterquartile rage (IQR) is the differece betwee the upper quartile (also called the third quartile or Q3) ad the lower quartile (also called the first quartile or Q). The quartiles are the values which divide the data ito quarters. The lower quartile is the 25 th percetile. The upper quartile is the 75 th percetile. IQR is a measure of the spread of a distributio. Checkpoit : What s aother ame for the secod quartile? There are several algorithms for fidig the quartiles by had. They do ot all give the same result because it s ot clear how the lower quartile, for example, should be defied for a variable with = 7 cases. However, they geerally give very similar aswers. The oe we ll use whe we do computatios by had is described below. Example: Sammy Sosa s ad Barry Bods home ru couts: Barry Bods home ru couts: 6 9 24 25 25 33 33 34 34 37 37 40 42 45 46 46 49 73 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 media positio is (8+)/2 = 9.5, halfway betwee the 9 th ad 0 th ordered values. Hece, M = media = (34+37)/2. M= 35.5.
Q is media of the 9 values below the positio of M; hece Q is 5 th value; Q= 25. Q3 is media of the 9 values above the positio of M; hece Q3 is 5 th value above M (or 5 th from top); Q3 = 45. IQR = Q3 - Q = 45 25 ; IQR =20. Note : IQR is a sigle umber; it is ot the iterval from 25 to 45, it s the legth of this iterval. Sammy Sosa s homeru couts: 4 8 0 5 25 33 36 36 40 40 49 50 63 64 66 2 3 4 5 6 7 8 9 0 2 3 4 5 Checkpoit 2: Fid media ad IQR for Sosa ad compare to Bods. Media = Q = Q3 = IQR = 5-umber summary While the media ad IQR are a useful two-umber summary of ceter ad spread; a more complete summary is the 5-umber summary: miimum, Q, media, Q3, maximum mi Q M Q3 max Bods 6 25 35.5 45 73 Sosa The 5-umber summary divides the data approximately ito quarters. The most commo use of the 5-umber summary is as the basis for creatig a boxplot (also called a box-ad-whisker plot), a hady graphical tool for comparig two or more distributios. Boxplots A boxplot is a graphical display of a 5-umber summary with oe modificatio: poits which are outliers are idetified ad plotted idividually. Bods: Upper fece = Q3 +.5 IQR = 45 +.5(20) = 75 Lower fece = Q.5 IQR = 25.5(20) = -5 The cetral box i a boxplot shows Q, media, ad Q3. The whiskers exted to the most extreme values which are withi the feces (larger tha 5 ad smaller tha 75). Ay poits outside the feces are cosidered outliers ad are plotted idividually. For Bods, all the values are withi the feces, so there are o outliers to be plotted idividually. Note: this is ot the oly defiitio of a outlier. Perhap Bods s 73 is a outlier. This is simply a reasoable defiitio that a computer ca use.
Checkpoit 3: Repeat these calculatios for Sosa. Sosa: Upper fece = Q3 +.5 IQR = Lower fece = Q.5 IQR = Side by side boxplots for Sosa ad Bods. Sosa Bods 0 0 20 30 40 50 60 70 80 Homerus Draw what you thik Sosa s data would look like as a histogram: Draw what you thik Bods data would look like as a histogram:
A boxplot does ot show the shape of the distributio as well as a histogram; it caot show multiple modes, for example. Its big advatage is that it ca be used to compare several distributios easily. Note: if there are a small umber of data values i each group (0 to 5 or less), the you should cosider makig side-by-side dotplots that show the actual values istead. Ad be very sure that you do t use a boxplot to summarize a data set of 5 values or less. Mea ad stadard deviatio The most commo umerical summary of a distributio is the mea (a measure of ceter) ad the stadard deviatio (a measure of spread). y y,..., Notatio: the data values are geerally deoted. The mea is deoted by y (proouced y-bar ). The formula for the mea is the,, 2 y y = y i i=, where deotes summatio. We ofte use a shorthad otatio, y = y ; which is ot as precise mathematically, but as log as we uderstad what it meas, we re OK. The stadard deviatio is s = i= ( y i y) 2, or, i shorthad, ( y y) 2. This is roughly the average distace of the data values from the mea, which is a logical measure of spread. Roughly because it is actually the square root of almost the average squared distace of the data values from the mea. Takig the square root puts it back i the origial uits. Why ot simply take the average distace to the mea, y y i? This is a legitimate measure of spread, but is ot commoly used because the stadard deviatio has some ice properties for some distributios, oe of which we ll discover later.
ame Sosa Bods hr hr Mea Std. Deviatio 5 35.93 20.47 8 36.56 3.2 Resistace: A measure is said to be resistat if it is ot much affected by chages i the umerical values of a small proportio of the observatios (i.e., it is resistat to outliers) Checkpoit 4: Is the media a resistat measure of ceter? How about the mea? Is the IQR a resistat measure of spread? How about the stadard deviatio? Relatioship betwee mea ad media Checkpoit 5: What s the relatioship betwee the mea ad media for the followig distributio shapes: Symmetric Skewed to the right Skewed to the left Summarizig a distributio with a measure of ceter ad spread which measures should you use? Sice the mea ad stadard deviatio are ot resistat, they are ot appropriate for skewed distributios or distributios with outliers. They re most appropriate for symmetric distributios with o outliers. Symmetric distributio with o outliers: mea ad stadard deviatio (possibly, media ad IQR also) Skewed distributios: media ad IQR Symmetric distributios with outliers: media ad IQR or mea ad stadard deviatio with ad without outliers
Checkpoit 6: Why use the mea ad stadard deviatio at all if the media ad IQR are always appropriate? Other measures of ceter ad spread: Trimmed mea: mea computed after trimmig off a percetage of the largest ad smallest values. A 5% trimmed mea is the mea after trimmig off the 5% of largest ad 5% of smallest values. It s a useful compromise betwee the media (which is the 50% trimmed mea) ad the mea. There is a trimmed stadard deviatio also. Midrage = average of smallest ad largest values ad Rage = maximum - miimum Checkpoit 7: Which of the above measures are resistat?