Lecture 1: Simple descriptive statistics I

Lecture : Simple descriptive statistics I L. Example: Number of egg clutches received by male sticklebacks The three-spie stickleback, Gasterosteus aculeatus, reproduces as follows. Male sticklebacks build ests. A female is the attracted to the est where she lays a clutch of eggs. As soo as the eggs are laid she leaves the est. The male eters ad fertilizes the eggs. He the chases the female away ad begis hutig for ew mates. The data below give the umber of egg clutches received by oe hudred male sticklebacks. 2 5 2 0 3 2 0 4 0 7 0 3 5 2 0 5 0 4 3 4 0 3 0 2 4 0 0 3 5 4 0 0 0 2 5 3 0 0 4 0 3 0 0 5 0 5 2 4 0 5 6 3 0 0 5 0 5 2 0 7 4 0 3 3 0 4 2 5 0 0 2 0 0 2 4 2 0 5 0 2 4 3 0 3 2 0 4 2 0 It is ot easy to quickly size up these data so summarise the iformatio i a frequecy distributio as tabulated below. Number of clutches received 0 2 3 4 5 6 7 Frequecy 35 5 2 2 2 2 It ca be see that a few males received may clutches yet a third of the males received o clutches. The umber of clutches with highest frequecy is zero. This is called the mode of the frequecy distributio. We ca illustrate these data usig a lie chart, the heights of the lies beig proportioal to the frequecies. Frequecy 0 0 20 30 40 0 2 3 4 5 6 7 Number of clutches received A secod sample of forty sticklebacks gave the results below. Number of clutches received 0 2 3 4 5 Frequecy 2 9 8 6 4 Suppose we wat to compare the two samples. Because of the differet sample sizes it makes sese to plot lie graphs showig the relative frequecy or proportio at each clutch size. For example, for clutches of size two i the secod sample the relative frequecy is 8/40 = 0.20, whereas for the first sample it is 5/00 = 0.5. 3

Relative frequecy 0.0 0. 0.2 0.3 0.4 First sample of 00 males Secod sample of 40 males 0 2 3 4 5 6 7 Number of clutches received What is the average umber of clutches received by the sample of 00 males? Average umber = sum of clutches 00 males i sample = 2 + 5 + 2 + + 2 + 0 = 205 = 2.05 clutches. 00 00 Mathematically this ca be writte as follows: deote the umbers of clutches for the males i the sample by x, x 2,..., x, ad the average or sample mea by x (read as x-bar ). The, sample mea x = x + x 2 + + x = x i. The mea could also be evaluated usig the frequecies preseted i the frequecy distributio. Here zero clutches were observed 35 times, oe clutch times, ad so o. x = sum of clutches (35 0) + ( ) + + (2 7) = = 205 = 2.05 clutches. 00 males i sample 00 00 Mathematically, suppose there are k distict umbers, x, x 2,..., x k, which are observed with frequecies f, f 2,..., f k respectively. The, x = f x + f 2 x 2 + + f k x k = f i x i. Notice that the total frequecy is just i f i =. The sample mea provides a measure of locatio about which the sampled values are spread. For this example the values have a mea of 2.05 clutches. A alterative measure of locatio is the sample media. This is the middle value if we imagie orderig the sample values from smallest to largest. Suppose we re-order the 00 values from smallest to largest. The middle pair of ordered values, the 50 th ad the 5 st, are 2 ad 2 respectively, ad so the media is 2 (2 + 2) = 2 clutches. The mea ad media are examples of summary statistics. They provide summary iformatio about the data set examied. The stickleback data suggests that while the average umber of egg clutches per male is about two, some males receive large umber of clutches, whereas a sizeable proportio of ests had o females spawig with their owers. Why are so may males uable to attract females to spaw i their ests? Why are some males very successful? Aswerig these questios leads zoologists o to further studies, throwig up more data for the statisticia to aalyse. If you wat aswers to these questios tur to the pages of Scietific America, April 993. 4

L. Example: Book setece legth The umber of words i the first 35 seteces of Emily Brotë s Wutherig Heights are respectively, 9, 6, 23, 20, 3, 45, 4, 5, 6, 42, 25, 54, 43, 34, 5, 5, 9, 20, 60, 26, 45, 48, 22, 65, 34, 35, 27, 25, 26, 32, 42, 4, 93, 35, 27. The sample mea is x = x i = 9 + 6 + + 35 + 27 35 = 080 35 = 30.86 30.9 words. To fid the sample media rak the origial values from smallest to largest to give: 3, 4, 5, 6, 6, 9, 4, 5, 9, 20, 20, 22, 23, 25, 25, 26, 26, 27, 27, 32, 34, 34, 35, 35, 42, 42, 43, 45, 45, 48, 5, 54, 60, 65, 93. The sample media, the 8 th whe the values are raked i order, is 27. Oe problem with the sample mea x is that it ca be sesitive to outliers or extreme values i the sample. For example, removig the setece with 93 words from the sample chages the sample mea for the remaiig 34 seteces to 987 34 = 29.0 words. What about the media? Remove the setece with 93 words ad order the remaiig 34 seteces from smallest to largest. The middle pair of ordered values, the 7 th ad 8 th, are 26 ad 27 respectively, ad so the media is 2 (26 + 27) = 26.5. The media is relatively robust to outliers. Although the media is more robust to outliers, the sample mea is still the most widely used measure of locatio. There are o average about thirty words per setece with some spread about this mea. How ca we measure this spread? A simple way to measure the spread of values would be to use the sample rage give by Rage = Maximum value Miimum value. For this example, the rage is 93 3 = 90 words. Ufortuately, the rage will be severely affected by outliers ad is of limited use. We still wat to kow how much the data values are spread or dispersed about the mea. We wat a measure of dispersio about the sample mea x. For ay value x i the distace or deviatio (x i x) shows how much x i differs from x. x x i x It is o good usig the average of (x i x) as a measure of dispersio. It is clearly always equal to zero. Mathematically, x i (x i x) = x i x = x ( x) = 0. We could measure dispersio usig the mea of the absolute deviatios x i x, mea absolute dispersio = x i x. Ufortuately this is aalytically difficult to use. 5

Cosider istead the squared deviatio of x i about the mea x, give by (x i x) 2. The average squared deviatio about the sample mea is (x i x) 2. For reasos which are discussed i a later lecture it is better to use divisor ( ) ad defie the sample variace s 2, sample variace s 2 = (x i x) 2. For data sets closely clustered about x the values (x i x) 2 will be small ad s 2 will i tur be small. For data values more widely spread about x we would expect s 2 to be large. Notice that s 2, beig made up of squared terms (x i x) 2, ca ever be egative. It is easier i practice to evaluate s 2 usig the formula { s 2 = x 2 i x2. This holds because givig s 2 = (x i x) 2 = For the setece legth data = 35, x = 30.857, = = = (x 2 i 2x i x + x 2 ) { x 2 i 2 x x i + x 2 { x 2 i 2 x( x) + x2 { x 2 i x 2. x 2 i = 9 2 + 6 2 + 23 2 + + 35 2 + 27 2 = 46366, { s 2 = x 2 i x2 = 34 { 46366 35(30.857) 2 = {46366 33325.4 34 = 3040.29 = 383.54 (words) 2. 34 Sice s 2 is measured i uits of (words) 2 whereas x is measured i uits of (words) defie, as a alterative measure of dispersio, the sample stadard deviatio s, give by s = + s 2. For this example, s = 383.54 = 9.58 9.6 words. As a compariso, i Douglas Adams s The Hitch Hiker s Guide to the Galaxy the first twety five seteces have sample mea 5.5, media 4, sample stadard deviatio. 6

L. Example: Size of casual groups Two researchers recetly studied the frequecy distributio of the size of casual groups of people at cocktail parties, at shoppig cetres, of childre at play, ad so o. Oe such distributio of 2423 groups, obtaied o a Sprig afteroo i Portlad, Orego, is give below. The sample mea is foud usig Size of group 2 3 4 5 6 Frequecy 486 694 95 37 0 x = f i x i = (486 ) + (694 2) + (95 3) + 2423 = 3663 2423 =.5. The actual value of 3663/2423 is.57623. It is sesible to roud this as.5 or.5. For this data set, with k distict values x, x 2,..., x k observed with frequecies f, f 2,..., f k, the sample variace s 2 is defied usig s 2 = f i (x i x) 2. I practice evaluate the sample variace usig the equivalet formula { s 2 = f i x 2 i x 2. Ca you prove this follows from the previous defiitio of s 2? It is ofte easier whe evaluatig x ad s 2 by had to display the calculatios i a table. x = x i f i f i x i f i x 2 i 486 486 486 2 694 388 2776 3 95 585 755 4 37 48 592 5 0 50 250 6 6 36 Totals = 2423 3663 6895 f i x i = 3663 =.576.5. 2423 { s 2 = f i x 2 i x2 = { 6895 2423(.576) 2 = 0.56045 0.560. 2422 s = s 2 = 0.56045 = 0.7486 0.75. I calculatig s 2 do ot use the rouded value of x. For example, usig x.5 gives s 2 = {6895 2423(.5) 2 /2422 = 0.5658 0.566. Usig x.5 gives s 2 = 0.596 ad usig x 2 gives s 2 =.55! I practice it is better ot to roud x i calculatig s 2 ad so use x 2 = (3663/2423) 2 i the formula for s 2. 7

A alterative way to preset the data is to show the cumulative frequecies. The cumulative frequecy at ay value x satisfies Cumulative frequecy at x = Number of observatios with value x. Size of group x 2 3 4 5 6 Frequecy 486 694 95 37 0 Cumulative frequecy at x 486 280 2375 242 2422 2423 The cumulative frequecies ca be plotted o a graph. Sice the size of group is a discrete quatity, oly takig iteger values here, the cumulative frequecy plot is a step-fuctio. This is easily see. For example, for all values x satisfyig x < 2 the cumulative frequecy is 486. Cumulative frequecy 0 500 000 2000 0 2 3 4 5 6 Size of group The cumulative frequecy plot icreases mootoically betwee 0 ad, the total frequecy. Sometimes we plot the cumulative percetage, which icreases from 0% to 00%, or the cumulative relative frequecy, which icreases from 0 to. 8

L. Example: Legths of cotto yar The followig are legths per uit weight (haks of 840 yd/lb) of oe hudred test specimes from a batch of cotto yar. 36.6 38. 35.0 37.3 36. 38.7 37.9 37.8 38.2 36.7 38.5 37.6 37.8 36.3 36.6 36.2 37.8 37.3 37.4 35.4 35. 37.9 36.0 38.2 38.2 38.4 35. 36.2 36.4 36.9 37.3 37.9 36.5 36. 38.8 38.6 38.4 37.3 37.7 37.3 36.4 36.6 37.5 37.2 37.2 38.4 35.8 38.9 37.2 38.3 38.3 37.4 38.3 38.4 37.9 36.9 36.5 39.0 36.5 36.9 37.2 35.4 39.6 39.6 36.6 36.2 37.4 37.2 36.6 37.4 36.6 38.5 38. 37.5 38. 37.5 36.2 38.0 36. 37.0 38.0 37.3 36.9 36.0 37. 36.4 34.9 37.0 36.4 37.7 38.7 36.3 37.3 37.5 37.9 35.8 37.0 37.0 37. 36.6 To costruct a frequecy distributio givig the frequecies for each of the values from 34.9 to 39.6 would ot provide a good summary of these data. To better summarise these data we ca group the observatios ito classes ad record the umber of observatios i each class. Legth 34.0 34.9 35.0 35.9 36.0 36.9 37.0 37.9 38.0 38.9 39.0 39.9 Frequecy 7 30 37 22 3 The class 39.0 39.9 has collected together all observatios recorded as 39.0, 39., 39.2, 39.3, 39.4, 39.5, 39.6, 39.7, 39.8, 39.9. Similarly for the other classes. No observatio ca lie i more tha oe class. The smallest ad largest possible recorded values i the 39.0 39.9 class are 39.0 ad 39.9 respectively. These are the class limits. They defie the class. Sice the data values are recorded rouded to the earest 0. uit, the class 39.0 39.9 has collected all observatios with outcomes betwee 38.95 ad 39.95. These two values are the lower ad upper class boudaries. The mid-poit of the 39.0 39.9 class is 39.45. This is called the class mark. The distace betwee the lower ad upper class boudaries is the class width, ad equals oe uit here. Some detail has bee lost i groupig the actual data values ito classes, but we have gaied a better impressio of the way the data are distributed. Notice that these data are a example of cotiuous data. The sample values could take ay value i some iterval, eve though we may oly record them to the earest whole umber, or here earest 0. uit. 9

A histogram ca be used to display these data. O each class iterval erect a block whose area is proportioal to the class frequecy. The class boudaries 33.95, 34.95,...ca be approximated as 34, 35,...for clarity. Freq. per uit legth 0 0 20 30 40 34 35 36 37 38 39 40 Legth To calculate the sample mea ad variace for these grouped data use the class marks as the x i values ad the class frequecies as the frequecies f i. Legth Class mark x i Frequecy f i f i x i f i x 2 i 34.0 34.9 34.45 34.45 86.8025 35.0 35.9 35.45 7 248.5 8796.975 36.0 36.9 36.45 30 093.50 39858.0750 37.0 37.9 37.45 37 385.65 5892.5930 38.0 38.9 38.45 22 845.90 32524.8550 39.0 39.9 39.45 3 8.35 4668.9075 Totals = 00 3726.00 38928.500 x = f i x i = 3726.0 = 37.26 37.3 uits. 00 { s 2 = f i x 2 i x 2 = { 38928.5 00(37.26) 2 = 0.98374 0.98. 99 s = s 2 = 0.98374 = 0.9994 0.99 uits. The sample stadard deviatio is ofte quoted to oe more sigificat figure tha the sample mea. If the data had ot bee grouped ito classes but the origial 00 values used to calculate the sample mea ad variace, they would have give the followig results, x = f i x i = 3722.9 00 = 37.229 37.23 uits. { s 2 = f i x 2 i x2 = { 38699.5 00(37.229) 2 =.006726.007. 99 s = s 2 =.006726 =.0033575.003 uits. By groupig the data some fie detail has bee lost but a overall impressio of the way the data behaves has bee gaied. 0

L. Example: Failure of aircraft air coditioig equipmet NOT examied! The followig data, reported by Proscha, summarise the itervals i service hours betwee failures of the air coditioig i oe Boeig 720 jet aircraft. Time betwee failures (hours) 0 50 50 00 00 50 50 200 200 250 250 300 Frequecy 8 7 2 0 2 The sample mea ad media provide summaries of locatio. The sample variace ad stadard deviatio provide summaries of the dispersio. Ca we summarise other aspects of these data? Freq. per 50 hour class 0 5 0 5 20 0 50 00 50 200 250 300 Time betwee failures (hours) The histogram shows that the frequecy distributio has a log tail o the right; it is skewed to the right. How ca skewess be measured? Suppose there are k distict values x, x 2,..., x k, which are observed with frequecies f, f 2,..., f k respectively, so that there are = f i observatios i total. Defie i skewess = m 3 = f i (x i x) 3. The quatity m 3 measures the symmetry of a distributio. If a data set is symmetric about the sample mea x the there will be as may positive values of (x i x) 3 as egative values, so that cacellatio occurs ad m 3 = 0. Note that though a symmetric distributio has m 3 = 0 it does ot ecessarily follow that a data set with m 3 = 0 is symmetric. I practice it is easier to evaluate m 3 usig a alterative formula m 3 = f i (x i x) 3 = f i (x 3 i 3x 2 i x + 3x i x 2 x 3 ) { { { = f i x 3 i 3 x f i x 2 i + 3 x 2 f i x i { { = f i x 3 i 3 x f i x 2 i + 3 x 3 x 3 { { = f i x 3 i 3 x f i x 2 i + 2 x 3. x 3 { f i

For calculatios doe by had, display the results i a table. Class Class mark x i Frequecy f i f i x i f i x 2 i f i x 3 i 0 50 25.0 8 450 250 28250 50 00 75.0 7 525 39375 295325 00 50 25.0 2 250 3250 3906250 50 200 75.0 0 0 0 0 200 250 225.0 2 450 0250 2278250 250 300 275.0 275 75625 20796875 Totals = 30 950 258750 5078750 x = f i x i = 950 30 = 65 hours. { s 2 = f i x 2 i x2 = { 258750 30(65) 2 = 455.724 455.7 hours 2. 29 Skewess m 3 = s = s 2 = 455.724 = 67.4665 67.5 hours. f i (x i x) 3 = { { f i x 3 i 3 x = 5078750 30 3(65) 258750 30 f i x 2 i + 2 x 3 + 2(65) 3 = 558000 hours 3. Sice the dimesio of m 3 depeds o the uits of measuremet defie a coefficiet of skewess b which is a dimesioless costat. Sice this example gives Coefficiet of skewess b = { f i (x i x) 2 = f i (x i x) 3 /{ ( ) s 2,.5 f i (x i x) 2. f i (x i x) 2 = 29 455.724 = 4400, 30 so the coefficiet of skewess b = 558000/(4400).5 =.9, idicatig a positive skewess. (We refer to positive or egative skewess, ot right or left skewess.) 2

Lecture 2: Simple descriptive statistics II L2. Example: Midday world temperatures The data below give the midday temperatures o 2st/22d December at 8 locatios aroud the world to the earest degree Celsius. 22 9 9 23 6 8 28 6 7 2 9 2 30 20 24 8 23 8 7 6 5 5 7 6 8 9 6 20 5 29 5 7 4 9 25 8 9 26 7 7 2 25 5 8 6 4 22 8 26 5 6 9 28 25 3 3 2 20 8 8 9 6 2 0 7 3 A simple frequecy distributio is give below. Temp. Freq. Temp. Freq. Temp. Freq. Temp. Freq. Temp. Freq. Temp. Freq. 25 0 5 0 5 0 5 5 5 25 3 24 0 4 0 4 0 6 6 6 26 2 23 3 0 3 0 7 3 7 3 27 0 22 0 2 0 2 2 8 0 8 9 28 2 2 0 0 0 9 5 9 3 29 20 0 0 0 0 0 0 20 3 30 9 0 9 0 2 0 2 0 3 8 0 8 0 2 0 2 3 22 2 32 0 7 7 0 3 3 23 33 0 6 0 6 4 0 4 2 24 34 0 The correspodig histogram, with classes of width oe degree Celsius, is give below. Freq. per class 0 2 4 6 8 0 20 0 0 0 20 30 Temperature ( C) As a summary display of the data it may be felt that this has too may class itervals to clearly show the behaviour of the data distributio. Perhaps the graphical summary might be improved by groupig some values together. Suppose temperature is grouped ito classes of width te degrees Celsius. Temperature C 24 to 5 4 to 5 4 to +5 +6 to +5 +6 to +25 +26 to +35 Frequecy 2 9 3 3 7 3

Freq. per 0 class 0 0 20 30 40 20 0 0 0 20 30 Temperature ( C) As a summary of the data this is ot too bad, but perhaps it might be felt to have oversummarised the data. The histogram above could have bee plotted with the same vertical scale as before to facilitate compariso. Now try groupig the data ito classes of width three degrees Celsius. Temp. Freq. Temp. Freq. Temp. Freq. Temp. Freq. 24 to 22 9 to 7 0 +6 to +8 4 +2 to +23 3 2 to 9 0 6 to 4 +9 to + 6 +24 to +26 6 8 to 6 3 to 5 +2 to +4 6 +27 to +29 3 5 to 3 0 0 to +2 2 +5 to +7 4 +30 to +32 2 2 to 0 0 +3 to +5 2 +8 to +20 5 +33 to +35 0 The correspodig histogram is give below. Freq. per 3 class 0 5 0 5 20 20 0 0 0 20 30 Temperature ( C) Perhaps this is the ideal graphical summary. Betwee five ad fiftee itervals ofte provides a good display of the data. The choice of umber of itervals will deped o the total umber of data values ad o their distributio. Sometimes a ope-eded class may be used, such as 30 ad above or 0 ad below. Oe way to hadle such cases i plottig histograms is to assume that the ope-eded class has the same width as its eighbourig class. 4

L2. Example: Depths of earthquakes i Fiji The statistical package R is a freely available statistical package ad is dowloadable from the iteret, see http://www.stats.bris.ac.uk/r. The third colum of data set quakes withi R gives the depths i kilometres for 000 earthquakes i the Toga trech off Fiji havig magitude greater tha 4.0. The R commad data(quakes) hist(quakes[,3]) gives the followig histogram. Freq. per 50 km class 0 50 00 50 200 250 0 00 200 300 400 500 600 700 Depth (km) The histogram above suggests the data is bimodal. Oe large group of earthquakes has modal class cetred at 75 km ad the other at 575 km. Cosider ow a histogram with class width 0km ad first class startig at 0 km. Values x satisfyig 40 x < 50 are put ito the 40 50 class. This ca be writte as the iterval [40, 50), a closed-iterval at the bottom ad a ope-iterval at the top. It ca be see that there is really Freq. per 0 km class 0 0 20 30 40 50 60 70 0 00 200 300 400 500 600 700 Depth (km) a peak ear 40 km, the miimum depth at which these earthquakes occur. The shape of a histogram is affected by both the choice of class width ad the startig poit of the classes. The R commad used here was: hist(quakes[,3],breaks=c(0:68)*0,right=false) # Gives ope iterval o right. By default R gives closed itervals o the right (right=true). 5

L2. Example: Diameters of stoe circles Betwee the middle of the Neolithic ad the Middle Broze Age, 3300 BC to 500 BC, large umbers of stoe circles were costructed i the British Isles. I The Stoe Circles of the British Isles Aubrey Burl lists the diameters for 80 stoe circles i Eglad. The data are give below. Diameter (feet) 0 0 0 20 20 30 30 40 40 50 50 60 Frequecy 0 20 2 4 8 Diameter (feet) 60 70 70 80 80 90 90 00 00 0 0 20 Frequecy 6 4 8 5 3 8 Diameter (feet) 20 30 30 40 40 50 50 200 200 300 Frequecy 2 4 3 5 8 The class 50 200 cotais all diameters x satisfyig 50 x < 200, so x [50,200). Frequecy 0 5 0 5 20 0 50 00 50 200 250 300 Eglish circles: Diameter (feet) The histogram above is wrog! Recall that area is proportioal to frequecy. If the vertical scale represets frequecy per 0 feet class, the five observatios i the 50 200 iterval is equivalet to oe observatio i each of the itervals 50 60, 60 70,..., 90 200. The height of the histogram for the 50 200 iterval should be.0. The area of this block is the height width = 5 = 5 observatios as required, where the width is five uits of 0 feet. Similarly the height for the 200 300 iterval should be 0.8. The correct histogram is below. Freq. per 0 class 0 5 0 5 20 0 50 00 50 200 250 300 Eglish circles: Diameter (feet) 6

For 286 stoe circles i Scotlad the diameters are show below. Diameter (feet) 0 0 0 20 20 30 30 40 40 50 50 60 Frequecy 30 46 22 3 3 Diameter (feet) 60 70 70 80 80 90 90 00 00 0 0 20 Frequecy 33 29 7 4 2 4 Diameter (feet) 20 30 30 40 40 50 50 200 200 300 Frequecy 4 3 0 5 4 Because the total frequecy for the Scottish stoe circles is ot the same as for the Eglish circles, the two data sets are best compared by plottig the relative frequecy i each iterval. For example, for the 20 30 iterval i Scotlad, the relative frequecy is 46 286 = 0.608. If the vertical scale is made relative frequecy per oe foot class the the height of the correspodig histogram block would be 0.0608. The correspodig relative frequecy for the [20, 30) iterval for the Eglish circles is 20 80 = 0. so that the height of the correspodig histogram block is here 0.0. The total area uder each histogram becomes uity. Rel. freq. per class 0.000 0.005 0.00 0.05 Rel. freq. per class 0.000 0.005 0.00 0.05 0 50 00 50 200 250 300 0 50 00 50 200 250 300 Eglish circles: Diameter (feet) Scottish circles: Diameter (feet) It ca be see that the Eglish circles have slightly fewer very small circles tha i Scotlad. This may be because there were origially fewer small circles i Eglad tha i Scotlad. Alteratively it could be that over the years more small circles have bee destroyed i Eglad tha i Scotlad. 7

L2. Example: Examiatio marks Two hudred studets were examied ad awarded a mark out of 00. The actual marks were 5, 36, 63, 82, 62, ad so o. The data were grouped as follows. Examiatio marks 30 39 40 49 50 59 60 69 70 79 80 89 Frequecy 2 3 64 63 25 5 The class 30 39 has collected all examiatio marks betwee 30 ad 39 iclusive. Imagie a studet s mark as beig rouded to the earest whole umber, so that this class has boudaries 29.5 ad 39.5 with class mark 34.5. Similarly for the other classes. Calculate the sample mea ad variace as before by displayig the results i a table. Class Class mark x i Frequecy f i f i x i f i x 2 i 30 39 34.5 2 44.0 4283.00 40 49 44.5 3 379.5 6387.75 50 59 54.5 64 3488.0 90096.00 60 69 64.5 63 4063.5 262095.75 70 79 74.5 25 862.5 38756.25 80 89 84.5 5 422.5 3570.25 Totals = 200 630.0 702320.00 x = f i x i = 630.0 200 = 58.5 58.2 marks. { s 2 = f i x 2 i x2 = { 702320.00 200(58.5) 2 = 30.832 30.8. 99 s = s 2 = 30.832 =.438.44 marks. Although settig out the calculatios i a table has helped us avoid makig some arithmetic errors it is still tedious to evaluate 64 54.5 2! Ca we simplify calculatios still further? Oe way to simplify calculatios is to code the data, fid the mea ad variace of the coded data, ad the decode to give the mea ad variace of the origial data. Suppose the data values are deoted by x, x 2,..., x. Code the data usig the trasformatio z i = x i m c where m ad c are arbitrary costats. Typically the value of m will be some value of x close to the cetre of the data distributio ad c will be the class width. Mea of coded values z = Sample variace of coded values = s 2 z = z i. (z i z) 2 = { zi 2 z 2. If z i = (x i m)/c, the x i = m + cz i, ad the sample mea of the origial data is x = x i = (m + cz i ) = m + cz i = m + c z i = m + c z. 8

so The To calculate the sample variace s 2 x of the origial data ote that x i = m+cz i ad x = m+c z, s 2 x = (x i x) = (m + cz i ) (m + c z) = c(z i z). (x i x) 2 = {c(z i z) 2 = c2 (z i z) 2 = c 2 s 2 z. To calculate the sample mea ad variace for the examiatio data usig codig we ca choose m = 54.5 ad c = 0. Our table of calculatios are give below. Class Class mark x i Frequecy f i z i = (x i 54.5)/0 f i z i f i zi 2 30 39 34.5 2 2 24 48 40 49 44.5 3 3 3 50 59 54.5 64 0 0 0 60 69 64.5 63 63 63 70 79 74.5 25 2 50 00 80 89 84.5 5 3 5 45 Totals = 200 73 287 Mea of coded values is Similarly Thus Similarly z = f i z i = 73 200 = 0.365. { s 2 z = f i zi 2 z 2 = { 287 200(0.365) 2 =.30832. 99 Sample mea x = m + c z = 54.5 + 0 0.365 = 58.5 58.2 marks. Sample variace s 2 x = c2 s 2 z = 00.30832 = 30.832 30.8. Notice how we use the otatio s 2 x to deote the sample variace of the x-values ad s 2 z to deote the sample variace of the z-values. The subscripts help to emphasize the particular data we are lookig at. This is of importace whe we might be studyig several differet data sets, with labels X, Y, Z, ad so o. 9

L2. Example: Price-earigs ratios of British geeral retail compaies The price-earigs ratios of sixty geeral retail compaies listed o the Lodo Stock Exchage o a certai date i September 200 are give below. P/E ratio 0 0 0 20 20 30 30 40 40 50 50 60 Frequecy 2 32 5 0 Suppose we wat to evaluate the cumulative frequecies. There are 2 observatios i the first class ad these must lie below the class upper boudary 0.0. Thus the cumulative frequecy at 0.0 is 2. Similarly there are 32 observatios i the 0 20 class. There are 53 observatios less tha, or equal to, 20.0, ad so o. Class Upper boudary Frequecy Cumulative frequecy at upper boudary 0 0 0.0 2 2 0 20 20.0 32 53 20 30 30.0 5 58 30 40 40.0 0 58 40 50 50.0 59 50 60 60.0 60 Plot the cumulative frequecies i a cumulative frequecy polygo. Costruct this by joiig the cumulative frequecies at each class upper boudary by straight lies. This is equivalet to drawig histograms with horizotal tops. I both cases we are assumig that data values i each class are uiformly spread throughout that class. Freq. per 0 uits 0 5 0 5 20 25 30 35 Cumulative freq. 0 0 20 30 40 50 60 M 0 0 20 30 40 50 60 P/E ratio 0 0 20 30 40 50 60 P/E ratio The media M is the middle value whe the data are ordered. If there are observatios i total, the media correspods to a cumulative frequecy of 2. I this example = 60 so the media correspods to a cumulative frequecy of 30. From the above cumulative frequecy polygo it looks as though M 2. Ca we determie M more precisely? 20

Yes! First fid the class i which the media lies, the media class, ad the use liear iterpolatio to obtai the media. By ispectio, the media class is the 0 20 class. At x = 0 the cumulative frequecy is 2. At x = 20 the cumulative frequecy is 53. There are 32 observatios spread over the 0 20 class. By iterpolatio, the media equals M = 0 + {(30 2)/32) 0 = 0 + 2.8 = 2.8. Cumulative freq. 20 25 30 35 40 45 50 M M 0 20 0 = 30 2 53 2 0 2 4 6 8 20 P/E ratio The media M divides the data ito two equal parts. Half the data values lie below the media, ad half above. Defie quartiles Q, Q 2, ad Q 3, which divide the data ito four equal parts. A quarter of the observatios lie below Q ; a quarter betwee Q ad Q 2 ; a quarter betwee Q 2 ad Q 3 ; ad a quarter lie above Q 3. Thus Q correspods to a cumulative frequecy of 4 = 5 here, Q 3 to a cumulative frequecy of 3 4 = 45 here, while Q 2 = M. Determie Q ad Q 3 by iterpolatio i like maer as for the media. Cumulative freq. 0 0 20 30 40 50 60 Q M Q 3 0 0 20 30 40 50 60 P/E ratio We kow that the sample mea ca be affected by extreme observatios. The same is true of the sample stadard deviatio s. A alterative measure of dispersio might be the distace betwee Q ad Q 3. I practice use half this distace, ad defie the semi-iterquartile rage, For this example Semi-iterquartile rage = 2 (Q 3 Q ). Q = 0 + {5/2 0 = 7.4, Q 3 = 0 + {(45 2)/32 0 = 7.50, Semi-iterquartile rage = 2 (Q 3 Q ) = 2 (7.50 7.4) = 5.8. 2

L2. Example: July raifall at Auradhapura, Sri Laka The followig data gives the July raifall, i iches, for a forty year period at a locatio i Sri Laka. Raifall i iches 0 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 Frequecy (years) 20 2 4 3 2 0 2 0 2 0 2 0 0 0 0 0 0 It ca be see that the July raifall has bee recorded to the earest ich. The successive classes have boudaries 0.0 0.5, 0.5.5,.5 2.5, 2.5 3.5, 3.5 4.5, ad so o. The first class has class width 0.5 iches ad all other classes have class width.0 iches. I drawig the histogram, recall that the area of each block is proportioal to the class frequecy. Suppose we make the vertical axis of the histogram frequecy per class iterval of oe ich. The the frequecy 2 for the 0.5.5 class will have height two uits. The area of this block will be.0 2.0 = 2. For the 0.0 0.5 class we have 20 observatios i a iterval of width 0.5 which is equivalet to forty values i a iterval of width.0. We make the height of the block for the 0.0 0.5 class equal to forty uits. The area of this block will be 0.5 40.0 = 20 as required. Freq. per class 0 0 20 30 40 0 5 0 5 20 Raifall (iches) What about calculatig the sample mea ad variace? The 0.5.5 class has mid-poit.0 so the class mark is.0. Similarly the.5 2.5 class has class mark 2.0, ad so o. However, the 0.0 0.5 class has mid-poit 0.25, so that this value is the class mark. 22

Raifall Class mark x i Frequecy f i f i x i f i x 2 i 0 0.25 20 5.0.25.0 2 2.0 2.0 2 2.0 4 8.0 6.0 3 3.0 3.0 9.0 4 4.0 3 2.0 48.0 5 5.0 2 0.0 50.0 6 6.0 0 0.0 0.0 7 7.0 2 4.0 98.0 8 8.0 0 0.0 0.0 9 9.0 2 8.0 62.0 0 0.0 0 0.0 0.0.0 2 22.0 242.0 2 2.0 0 0.0 0.0 3 3.0 0 0.0 0.0 4 4.0 0 0.0 0.0 5 5.0 0 0.0 0.0 6 6.0 6.0 256.0 7 7.0 0 0.0 0.0 8 8.0 0 0.0 0.0 9 9.0 9.0 36.0 Totals = 40 29.0 245.25 x = f i x i = 29.0 = 3.225 3.2 iches. 40 { s 2 = f i x 2 i x 2 = 39 {245.25 40(3.225)2 = 2.262 2.26. s = s 2 = 2.262 = 4.6 4.6 iches. What about calculatig the media ad quartiles? There is o real problem here ad these are foud as before. The oly thig to otice is that the class boudaries are successively 0.0, 0.5,.5, 2.5, 3.5, 4.5, ad so o. The media class is the 0.0 0.5 class where a cumulative frequecy of 20 correspods with the upper boudary. We have here = 40 observatios i total so the media, correspodig to a cumulative frequecy of 2 is give by M = 0.5 iches. For this example perhaps this is more represetative of the quatity of rai to be expected i ay July! The lower quartile Q lies i the 0.0 0.5 class also. There are twety observatios lyig i the 0.0 0.5 class so a cumulative frequecy of 4 = 0 will ituitively correspod with the class mid-poit. Usig iterpolatio gives, as expected, Q = 0.0 + (0/20) 0.5 = 0.25. There are thirty observatios less tha or equal to the upper boudary of the 3.5 4.5 class so the upper quartile is Q 3 = 4.5. Semi-iterquartile rage = 2 (Q 3 Q ) = 2 (4.5 0.25) = 2.25 2. iches. 23

L2. Example: Gestatioal ages of 53 ifats NOT examied! Huma gestatioal age is measured from the first day of a woma s last mestrual period util birth. The data below give the gestatioal ages i weeks for 53 births at St. George s Hospital, Lodo, over a eightee moth period. Age (weeks) 22 23 24 25 26 27 28 29 30 3 32 33 Births 0 0 6 3 6 7 7 Age (weeks) 34 35 36 37 38 39 40 4 42 43 44 Births 7 29 43 4 222 353 40 247 53 9 We have see several measures of locatio ad dispersio ad a measure of skewess. We ca geerate further summary statistics for the data. Suppose that we have k distict values x, x 2,..., x k, which are observed with frequecies f, f 2,..., f k, respectively, so that there are = f i observatios i total. Defie, for r =,2,3,..., i r th sample momet about the mea m r = f i (x i x) r. Notice that m 0, m 2 = ( ) s2 so that s 2 m 2, ad m 3 = skewess. Now defie, for r =,2,3,..., r th sample momet about the origi m r = f i x r i. Notice that m = x, the sample mea. We use the m r to evaluate the m r values more easily. For example, m 3 = m 2 = f i (x i x) 3 = f i (x i x) 2 = { { f i x 2 i x 2 = m 2 (m ) 2. { f i x 2 i 3 x f i x 2 i + 2 x 3 = m 3 3m m 2 + 2(m )3. We have see that m 2 s 2 ad so m 2 is also a measure of dispersio. We have also see that m 3 measures skewess, but, because it depeded upo the uits of measuremet, we defied a coefficiet of skewess b give by, coefficiet of skewess b = { f i (x i x) 3 /{.5 f i (x i x) 2. We ca see that b = m 3 /m.5 2. The momet m 4 is sometimes called the kurtosis. Agai, because m 4 depeds upo the uits of measuremet, defie a coefficiet of kurtosis b 2 give by, coefficiet of kurtosis b 2 = m 4 m 2. 2 Note that some textbooks defie skewess by b = m 3 /m.5 2 ad kurtosis by b 2 = (m 4 /m 2 2 3). 24

For these data we ca derive various summary statistics. Sample mea x = f i x i = 5907 53 = 39.0423 39 weeks. Treatig the data as grouped about each give mid-poit, we have, Media = 38.5 + (756.5 449) (802 449) = 39.37 39.4 weeks. Sample variace s 2 = f i (x i x) 2 = 4.409 4.4 weeks 2. Sample stadard deviatio s = s 2 = 4.409 = 2.00 2. weeks. Other summary statistics ca be obtaied. m 2 = f i (x i x) 2 = 4.408 weeks 2. Skewess = m 3 = Kurtosis = m 4 = f i (x i x) 3 = 2.8 weeks 3. f i (x i x) 4 = 27.74 weeks 4. Coefficiet of skewess b = m 3 m2.5 The gestatioal age exhibits egative skewess. Coefficiet of kurtosis b 2 = m 4 m 2 2 = 2.357 2.36. = 3.985 4.00. You met the ormal distributio i MATH75. This distributio has skewess zero ad kurtosis equal to three. Ideed, oe way to test whether a frequecy distributio comes from a ormal distributio is to see whether b 0 ad b 2 3. There is strog evidece that these data ca ot be modelled usig a ormal distributio. Oe problem with this data set is that it is ot clear whether gestatioal age of x weeks meas age betwee x 2 ad x+ 2 ad rouded to be x or age of x weeks measured i completed weeks so beig betwee x ad x + with mid-poit x + 2. We have assumed the former ad deoted the values by x i. Suppose i fact the latter defiitio had bee used. Deote these mid-poits by y i where y i = x i + 2. From what we kow o codig, ȳ = x + 2, so y i ȳ = x i x ad clearly the momets m r for the x ad y values are the same. The variace, skewess ad kurtosis of the x ad y values are the same. These summary statistics are said to be ivariat to a shift of locatio. Similarly, the coefficiets of skewess b ad kurtosis b 2 are ivariat to a chage of scale. For, suppose we re-scale the x values usig z = x/c. The m r (x values) = c r m r (z values), so that cacellatio of the c values occurs i calculatig b ad b 2. 25