Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

Aa Jaicka Mathematical Statistics 18/19 Lecture 1, Parts 1 & 1. Descriptive Statistics By the term descriptive statistics we will mea the tools used for quatitative descriptio of the properties of a sample (a give set of iformatio or data, comig from a larger populatio). These tools are purely arithmetical (do ot use methods based o the theory of probability), ad they are aimed at summarizig or visualizig the properties of the data. The preferred tools ad measures deped o the characteristics of the variables that are to be described, ad will vary from case to case. Variables studied with the use of statistical tools may be divided ito two mai groups: measurable ad categorical. The latter group cosists of variables which take o values from a limited set of values, represetig categories (such as eye color, level of educatio, sex etc.). The first group cosists of variables which take o meaigful, umerical values (that ca be measured such as height, weight etc.). 1 Measurable variables ca be further decomposed ito cotiuous (whe the value may be a umber from a rage of real umbers if measured with ifiite precisio for example velocity) ad cout variables (whe the possible values are discrete). Typical cases of cout variables are the umber of childre or studets erolled i a class. Some variables which at first sight resemble cotiuous radom variables are also categorical for example, wages ca be measured oly up to 1 currecy uit, ad ot with ifiite precisio. Such variables, called quasi-cotiuous variables, are treated as cotiuous i all practical applicatios. 1.1. Visualizig the data. We will start our presetatio of descriptive statistical tools with cout variables, to which oe ca apply basically all tools. Assume that we look at the grades from a Mathematical Statistics course, ad that these were equal to:, 4,,,.5,, 4.5,, 5,. Oe would eed to sped some time i order to be able to say somethig geeral about the grades for this course, ad we oly have 1 studets. It would help a little bit if we arraged the umbers i order:,,,,,,.5, 4, 4.5, 5. But eve ow, i case of larger data sets it is evidet: just eumeratig the values will ot be eough to determie at first glace whether, say, it is hard or easy to pass this course. What we ca do to better comprehed the properties of the variable uder study is to group it. I the case of cout variables (ad also i the case of categorical variables), the easiest way of groupig is groupig by precise value of the variable. I the case of studet grades, we have 6 ituitive groups, correspodig to the followig possible outcomes:,,.5, 4, 4.5 ad 5. Therefore, we could summarize the course outcome with the use of the followig table: grade Number of studets Frequecy % %.5 1 1% 4 1 1% 4.5 1 1% 5 1 1% 1 Numbers are also possible descriptios of the classes of categorical variables; for example, oe could describe the possible outcomes of a coi toss heads or tails as outcome 1 ad outcome, respectively. I this case, however, the values assiged to the two categories are ot meaigful ad could be chaged without loss of our uderstadig of the pheomeo. 1

The properties of the data uder study become more apparet ow: we see that approximately % studets fail, ad that aother % of studets obtai the lowest possible passig grade. We could also visualize the proportios of the particular outcomes graphically. The most commoly used graphs i such cases are the bar chart (with couts such as the graph with red bars below or frequecies such as the graph with blue bars below) ad pie chart..5.5 1.5 1.5.5..5..15.1.5.5 4 4.5 5.5 4 4.5 5.5 4 4.5 5 Note that we could have also grouped the data differetly, for example as outcome Number of studets Frequecy fail % pass 7 7% if, say, we were oly iterested i the failure rate for the course. Note also that this latter represetatio is also a groupig for the followig series of categorical data: fail, fail, fail, pass, pass, pass, pass, pass, pass, pass. I the case of categorical data, we ca also visualize it graphically by meas of bar charts (for umbers or frequecies) ad pie charts: 8 7 6 5 4 1.8.7.6.5.4...1 fail pass fail pass fail pass Let us ow look at a differet example a cotiuous variable. Let us assume that we aalyze the surface area of 1 apartmets available for sale i a give viciity (i square meters):.45,.1, 4.6, 5.78, 7.79, 8.54, 8.91, 8.96, 9.5, 9.67, 9.8, 41.45, 41.55, 4.7, 4.4, 4.45, 44.5, 44.5, 44.7, 44.8, 44.9, 45.1, 45.9, 46.5, 47.65, 48.1, 48.55, 48.9, 49, 49.4, 49.55, 49.65, 49.7, 49.9, 5.9, 51.4, 51.5, 51.65, 51.7, 51.8, 51.98, 5, 5.1, 5., 5.65, 5.89, 5.9, 54, 54.1, 55., 55., 55.56, 55.6, 56, 56.7, 56.8, 56.9, 56.95, 57.1, 57.45, 57.7, 57.9, 58, 58.5, 58.67, 58.8, 59., 6.4, 6.7, 64., 64., 64.6, 65, 66.9, 66.78, 67.8, 68.9, 69, 69.5, 7., 76.8, 77.1, 77.8, 78.9, 79.5, 8.7, 8.4, 84.5, 84.9, 85, 86, 89.1, 89.6, 9, 96.7, 98.78, 1, 17.9, 11.7, 118.9. This time, a sesible visual aalysis of the raw series is impossible to coduct. Costructig a simple frequecy table for sigle values would ot lead to better uderstadig of the pheomeo, as there are o repeated values i the series. I this case, i order to better see the properties, we eed to group the series ito class itervals. The choice of itervals for groupig is ofte ot a easy oe. First of all, it is optimal if the itervals are meaigful: if the govermet subsidizes the acquisitio of apartmets ot exceedig 75 square meters, it would be preferable that the value of 75 be a boud to a iterval, etc. Secod, it is better if the iterval rages are of similar legth (say, 1m each). Third, i some cases it is better

(for computatioal reasos) if the frequecies i the particular classes are balaced (i.e. we should avoid groupigs such that oe class has 95 elemets ad the other has 5 elemets, etc.). We should also avoid groupigs which are too detailed or ot detailed eough. Ad, last but ot least, both for computatioal ease ad visual clarity, it is preferable that the classes have eat values. I the case of our apartmet size example, it seems reasoable to group the series ito itervals of width equal to 1, startig from. I this case, all itervals have the same legth ad all have eat ceters (or class marks). Iterval Class mark Number Frequecy Cumulative Cumulative of apartmets umber frequecy c i i f i c i cf i (,4] 5 11,11 11,11 (4,5] 45, 4,4 (5,6] 55, 67,67 (6,7] 65 1,1 79,79 (7,8] 75 6,6 85,85 (8,9] 85 8,8 9,9 (9,1] 95, 96,96 (1,11] 15, 98,98 (11,1] 115, 1 1 Total 1 1 Based o the groupig, we ca ow visualize the data with the use of a histogram (the differece betwee a histogram ad the bar chart lies i the horizotal axis the bars are adjacet to each other, ulike i the previous case, where the categories were separate). Oe ca costruct both a histogram of umbers (couts), ad frequecies. 5.5. Number 5 15 Frequecy.5..15 1.1 5.5 5 5 45 55 65 75 85 95 15 115 15 5 5 45 55 65 75 85 95 15 115 15 Note that i case of cotiuous variables it is preferable use histograms istead of simple bar charts, ad pie charts are usually ot a good choice (uless we categorize the variable). Oe ca also visualize the distributio by meas of a cumulative frequecy histogram or the empirical CDF. 1.. Characteristics of the data. I case of categorical variables, there is ot much more that we ca do i terms of descriptive statistics to visualize the data. I case of measurable variables, we have a whole array of arithmetical tools that ca be used to describe the properties of the data set. There are two basic distictios for the characteristics describig the studied variable. The first differetiatio is based o the feature of the distributio that we wat to describe whether it is the overall magitude (how large, o average, are the values of the variable), the variability, the asymmetry etc. The secod distictio is based o the values that we will be usig for descriptio whether we will be usig differet momets of the distributio (i which case we will be talkig about classical measures) or measures of positio (such as the miimum, maximum, media etc.).

1..1. Measures of cetral tedecy. Measures of cetral tedecy are those which tell us where o average is the middle of the distributio located. The basic measures are: the arithmetic mea (the average, a classical measure) or the media ad the mode (positioal measures). Other measures of positio (ot ecessarily talkig about the middle of the distributio) iclude other quatiles (such as quartiles, deciles, percetiles etc.). If X 1, X,..., X are the sample values of the variable uder study (for example, the raw data for the surface areas of apartmets), the the arithmetic mea ca be calculated as X = 1 X i. The average value of the surface area of apartmets, for the data preseted above, would be equal to X = 1 (.45 +.1 + 4.6 + + 11.7 + 118.9) = 59.58. 1 If we are dealig with grouped data (for example, like i the case of studet grades above), we ca simplify the calculatios from the above formula by avoidig summig idetical values ad usig multiplicatio istead: X = 1 k X i i, where k is the umber of groups ito which we have divided our data, X i are the values i the groups, ad i are the couts of these groups. I our example of studet grades, we could calculate the average as X = 1 ( + + 1.5 + 1 4 + 1 4.5 + 1 5) =.. 1 I both of these examples, the arithmetic mea was calculated precisely. I some cases, however, we may be faced with a situatio where we do ot have exact data at our disposal, but oly some approximatios as is the case if we are provided with data aggregated ito class itervals. I such situatios, we will ot be able to calculate the true value of the mea, but we will be able to calculate a approximatio of the average. This is achieved by treatig all observatios from a give class as beig equal to the middle of the class iterval (the socalled class mark, see the headers i the table above), ad applyig the formula for grouped data, i.e. X = 1 k c i i. I the apartmet surface area example, the approximatio of the mea, calculated based o class iterval data, would be equal to X = 1 (5 11 + 45 + 55 + 65 1 + 75 6 + 85 8 + 95 + 15 + 115 ) = 58.7. 1 This result is differet tha the exact amout, which was equal to 59.58. Obviously, the arrower the class itervals used for groupig the data, the less iformatio is lost ad the more accurate is the approximatio. If raw data are available, they should always be used for calculatig characteristics i order to avoid precisio losses. The mea is a good measure to approximate the ceter of the distributio goverig the data, provided that the distributio has a expected value (ad, as we kow from probability calculus, ot all distributios do). Also, if there are outliers (very high or very low values, or erroeous observatios) i the data, the average will be affected by these observatios. A measure of cetral tedecy which does ot have these flaws is the media, or the middle observatio: (ay) umber such that at least half of the observatios are less tha or equal to it ad at least half of the observatios are greater tha or equal to it. 4

I order to calculate the media (as well as ay other measure based o rak), we will eed to rearrage our observatios i ascedig order. We will adopt the followig otatio: X i: will be the i-th smallest value of the elemet sample (the i-th order statistic). I this otatio, X 1: is the smallest value (miimum) i the sample, ad X : is the largest value i the sample (maximum). As for the media, the calculatios will deped o whether the sample size is odd or eve. If it is odd, the there exists a sigle middle observatio; if it is eve, there exist two observatios i the middle, ad we will take a average of these two as the media value. Therefore, Med = X +1 1 ( X : + X +1: ) if is eve. : if is odd, Goig back to our examples, we ca see that: I the case of surface apartmet area raw data, we have = 1 which is eve, so the media will be the average of the 5th ad 51st observatios: Med = 1 (55. + 55.) = 55.5. I the case of grades (grouped data), we eed to fid the class with the fifth ad sixth observatios; i our case, it is goig to be the class of the grade; therefore, the Med = 1 ( + ) =. I the case of grouped class iterval data, the situatio becomes more complicated. We will ot be able to provide a specific value, but oly the rage ito which the media should fall ito (this is the iterval where the cumulative frequecy reaches.5 for the first time). If we are iterested i a sigle value, rather tha a iterval, we will provide a approximatio. I order to derive the formula, please ote the followig. If we kow how may observatios there are i the sample i the classes before the class of the media, we will kow how much additioal observatios from the media class we should take i order to reach the middle observatio. The, kowig how may observatios are actually i the media class, ad assumig observatios are uiformly spaced i the class they fall ito, we should have that the media value is proportioally as far i the class iterval as the ratio of the umber of observatios we eed to reach it to the umber of observatios we have i the class. Therefore, we will use the followig approximatio: Med = c L + b M 1 i, M where M is the umber of the class iterval of the media, c L is the lower boud of the class iterval of the media, b is the legth of the class iterval of the media ad i is the umber of observatios i the i-th class iterval. I our surface area example, we would have the followig: the class i which the.5 threshold is reached, is the rd class, ie. the (5,6] class. This is goig to be the class iterval of the media (i.e., M = ). The lower boud of this class is 5, so c L = 5. I the classes before the third class, we have 11 + = 4 observatios, so we eed the 1 4 = 16-th observatio from the rd class (this is goig to be our media). Sice the legth of the class is b = 1, ad there are M = observatios overall i this class, the media ca be approximated as: Med 5 4 = 5 + 1 54.85. Please ote that this umber, agai, differs from the true value of 55.5, which meas that the approximate formula should be used oly i cases where raw data is ot available. I some cases, we may be iterested i describig the middle of the distributio with the most frequet observatio i the sample. This is ot always possible there are distributios which do ot have the property of a sigle most frequet value (for example, if the histogram has several peaks ). Therefore, the most frequet value called the mode is usually 5

oly calculated if the data come from a distributio with a stadard shape, i.e. whe the histogram has a sigle local maximum. The mode is the equal to this sigle maximum (the most frequet observatio i the sample). I the studet grades example, there are two equally frequet groups. We would ot defie a mode i this case. Please ote that for cotiuous variables, it is ot possible to calculate the sample mode uless we group the data because for cotiuous distributios, we will ot see two observatios which are equal to each other, ad thus each observatio has the same frequecy. But i such cases, it is possible to approximate the mode if we group the data first. If we are iterested i calculatig the mode, we should make sure that the itervals (at least i the middle of the distributio) have equal legths (otherwise, the results of the calculatios would be biased please ote that wider itervals will aturally have more observatios). Oce we have itervals of equal legth, we ca calculate the mode usig the followig formula: Mo = c L + b Mo Mo 1 ( Mo Mo 1 ) + ( Mo Mo+1 ), where, similarly to the formula for the media, we take c L, the lower boud of the class of the mode, ad add to it the appropriate fractio of the legth of the class of the mode (b). I this case, the appropriate fractio is calculated as the ratio of the differece betwee the couts of the class of the mode ( Mo ) ad the class adjacet to the left (with a cout of Mo 1 ), to the sum of the differeces betwee the cout of the class of the mode ad the classes adjacet to the left ad to the right (which has a cout equal to Mo+1 ). This meas that if the classes adjacet to the mode are equally less frequet tha the class of the mode, we should take the midpoit of the iterval as the approximatio of the mode. If the distributio is shifted to the left, i.e. smaller observatios are more frequet, the the mode should ot be i the middle of the iterval but more to the left; if the distributio is shifted to the right, i.e. larger observatios are more frequet, the the mode should be shifted to the right. I the case of our surface area example, all class itervals have equal legth (b = 1), so we ca calculate the mode. The class with the largest amout of observatios is the third class, so we have c L = 5, Mo = =, Mo 1 = =, Mo+1 = 4 = 1, ad Mo = 5 + 1 ( ) + ( 1) 5.. 1... Other measures of locatio. Additioal characteristics, which may be calculated i order to show where the values of the distributio (ot ecessarily the ceter of the distributio) are located, iclude quatiles other tha the media. For example, if we calculate the first ad the third quartiles (i.e., values such that they divide the sample ito subsamples coutig at least 1 ad observatios), we will kow i what rage the middle 5% 4 4 observatios are to be foud. I order to calculate the quartiles, we will use the same method as for calculatig the media; the oly differece is that we will be lookig for observatios which are raked ot (approximately) 1 out of, but (approximately) ad out of. 4 4 I particular, i cases where raw data are available, the quartiles will be calculated usig the geeral formula for quatiles of rak p, applied to p = 1 ad p =. This geeral formula 4 4 states that X p +1: if p Z Q p = 1 (X p: + X p+1: ) if p Z. For our apartmet surface area example, ad are iteger values, so we would take the 4 4 average of observatios umbered 5 ad 6 as Q 1, ad the average of observatios raked 75 ad 76 as Q, i.e. Q 1 = 47.65 + 48.1 47.88, ad Q = 6 66.78 + 67.8 67.9.

For grouped class iterval data, we will use the same mechaism of determiig the approximate value from the appropriate iterval that we used for the media, albeit with adjusted couts, i.e.: Q 1 = 4 cl + b M 1 i, M ad Q = 4 cl + b M 1 i, M where the values of M, c L, b ad i are defied aalogously as i the case of the media (but for the first ad third quartiles, respectively). For example, if we wated to calculate the first ad third quartiles of the distributio of apartmet surface areas, we would search for the observatios for which the cumulated frequecy reaches.5 ad.75, respectively; we would therefore have that the first quartile is located i the iterval (4, 5], while the third quartile is located i the iterval (6, 7], ad we would have: ad Q 1 = 4 + 1 5 11 46.9, Q 75 67 = 6 + 1 66.67. 1 The two values calculated o the base of grouped data are, agai, oly approximatios of the true values (calculated above). 1... Measures of variability. Oce we kow where the values of the variable uder study are located (more or less), we may wish to determie whether they are cocetrated aroud the ceter of the distributio, or dispersed. I order to do so, we will use measures of variability. These, too, ca be calculated based o momets of the empirical distributio, or o order statistics. We will start with the latter group. The most simple measure of the variability of a radom variable is the rage, i.e. the differece betwee the smallest ad the largest values observed i the data. I case of grouped class iterval data, we take the differece betwee the lower boud of the lowest iterval, ad the upper boud of the highest iterval. This measure, although simple, has may drawbacks; the most importat oe is that it is very susceptible to outliers (atypical observatios). Therefore, i may cases, istead of this rage, we look at the spread betwee the first ad the third quartiles: IQR = Q Q 1 i.e. the iterquartile rage (also called midspread or middle fifty). This measure is much more robust, as it covers the middle 5% observatios oly. Based o this measure which depeds o the scale of the variable uder study we ca calculate coefficiets of variatio: V Q = Q Med, V Q 1 Q = IQR Q + Q 1 (where Q = IQR/ is the quartile deviatio). These coefficiets allow us to compare dispersio of differet variables. Examples: I our studet grade example, calculatig the rage does ot tell us much it is equal to 5 = ad actually does ot deped o the distributio of the grades (i.e., o whether the subject is easy or hard to pass). I the surface area example, the rage, calculated for raw data is equal to 118.9.45 = 86.45, while for grouped class iterval data it is equal to 1 = 9, ad is obviously always biased upwards (the wider the itervals, the more so). 7

O the other had, if we calculate the iterquartile rage, it is equal to 66.67 46.9 =.58. This value is tellig, i that it shows that the middle 5% observatios are quite cocetrated, i.e. half of the the surface areas of apartmets are relatively close to the media value. Turig to the classical measures of dispersio, we will start with the variace, which, for raw data, is calculated as Ŝ = 1 i (X X) = 1 Xi ( X), for grouped data as Ŝ = 1 k i (X i X) = 1 k i Xi ( X), ad for grouped class iterval data is approximated as: Ŝ 1 k = i ( c i X) = 1 k i c i ( X). The last formula gives ubiased results if the variable is distributed uiformly i.e., we expect that the observatios classified i itervals are distributed symmetrically aroud the ceters of these itervals. If this assumptio is ot true as it is, for example, if the data come from a ormal distributio, where values further from the ceter of the distributio are less commo, ad we therefore expect values i itervals to be located more to the side the approximate formula for the variace is goig to systematically overestimate the variability i the data. I case of the ormal distributios, the followig formula for a correctio (the so called Sheppard s correctio) has bee proposed: S = Ŝ 1 k i (c i c i 1 ), 1 where c i s deote the bouds of the class itervals. If all the itervals are of equal legth, the value of the correctio reduces to c, where c is the legth of the iterval. 1 Now, if we calculated the variace for the surface area example based o raw data ad usig the value of the mea also calculated for raw data (i.e. 59.58), we would fid that the variace is equal to.85. If, o the other had, we were to use the approximate formula for grouped class iterval data, ad assumig that the mea was also calculated based o grouped data (ad equal to 58.7, approximately), we would have that the approximatio of the variace is equal to Ŝ 1 = 1 ( (5 58.7) 11 + (45 58.7) + (55 58.7) + (65 58.7) 1 +(75 58.7) 6 + (85 58.7) 8 (95 58.7) + (15 58.7) + (115 58.7) ) 1.1 Please ote that this approximatio is already smaller tha the true value of the variace, ad therefore subtractig the Sheppard s correctio (equal to 1 ) would just itroduce additioal error. This is because although the distributio of surface areas is ot uiform, it is t 1 ormal, either; also, the sample size may be too small (the errors resultig from small sample size may be larger tha the errors arisig from class groupig) to use the correctio. Please also ote that the variace is a measure expressed i squares of the uits of the variable uder study. If we wished to have a measure of variability expressed i the same uits, we would take the square root of the variace ad calculate the stadard deviatio: Ŝ = Ŝ, or S = S. I the surface area example, we would have Ŝ = 18.7 8

Now, if we were iterested i comparig the dispersio of differet variables for the same populatio, or the same variable for differet populatios, we would eed a measure of variability which would be ivariat to scalig (ad uits) of the variables. We ca costruct such a measure, called the coefficiet of variatio, by takig the ratio of the stadard deviatio ad the mea of the variable uder study: 1..4. Measures of asymmetry. V S = Ŝ X. 9