Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x = x 1 + + x n n n = 1 x i n i=1 The median The median is the midpoint of a distribution: the number M such that half the observations are smaller and half are larger. How to find the median Suppose the observations are x 1, x 2,..., x n. 1. Arrange the data in increasing order and let x (i) denote the i th smallest observation. 2. If the number of observations n is odd, the median is the center observation in the ordered list: M = x ((n+1)/2) 3. If the number of observation n is even, the median is the average of the two center observations in the ordered list: M = x (n/2) + x (n/2+1) 2 Numerical Description of Data, Jan 7, 2004-1 -

Measures of center Examples: Data set 1: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 2 4 3 4 6 5 4-6 5 Arrange in increasing order: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) -6 2 3 4 4 4 5 5 6 There is an odd number of observations, so the median is M = x ((n+1)/2) = x (5) = 4. The mean is given by x = 2 + 4 + 3 + 4 + 6 + 5 + 4 + ( 6) + 5 9 = 27 9 = 3. Data set 2: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1 Arrange in increasing order: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) 1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8 There is an even number of observations, so the median is M = x (n/2) + x (n/2+1) 2 The mean is given by x = = x (5) + x (6) 2 = 4.1 + 4.2 2 = 4.15. 2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1 10 = 44.9 10 = 4.49. Numerical Description of Data, Jan 7, 2004-2 -

Mean versus median The mean is easy to work with algebraically, while the median is not. The mean is sensitive to extreme observations, while the median is more robust. Example: 0 1 2 3 4 5 6 7 8 9 10 The original mean and median are x = 0 + 1 + 2 3 = 1 and M = x ((n+1)/2) = 1 The modified mean and median are x = 0 + 1 + 10 = 3 2 3 3 and M = x ((n+1)/2) = 1 If the distribution is exactly symmetric, then mean=median. In a skewed distribution, the mean is further out in the longer tail than the median. The median is preferable for strongly skewed distributions, or when outliers are present. Numerical Description of Data, Jan 7, 2004-3 -

Measures of spread Example: Monthly returns on two stocks 40 Stock A 40 Stock B Frequency 30 20 10 Frequency 30 20 10 0 10 5 0 5 10 15 20 0 10 5 0 5 10 15 20 Daily returns (in %) Daily returns (in %) Stock A Stock B Mean 4.95 4.82 Median 4.99 4.68 The distributions of the two stocks have approximately the same mean and median, but stock B is more volatile and thus more risky. Measures of center alone are an insufficient description of a distribution and can be misleading The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread. Common measures of spread are the quartiles and the interquartile range the standard deviation Numerical Description of Data, Jan 7, 2004-4 -

Quartiles Quartiles divide data into 4 even parts Lower (or first) quartile Q L : median of all observations less than the median M Middle (or second) quartile M = Q M : median of all observations Upper (or third) quartile Q U : median of all observations lgreater than the median M Interquartile range: IQR = Q U Q L distance between upper and lower quartile How to find the quartiles 1. Arrange the data in increasing order and find the median M 2. Find the median of the observations to the left of M, that is the lower quartiles, Q L 3. Find the median of the observations to the right of M, that is the upper quartiles, Q U Examples: Data set: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 2 4 3 4 6 5 4-6 5 Arrange in increasing order: x (1) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) -6 2 3 4 4 4 5 5 6 Q L is the median of { 6, 2, 3, 4}: Q L = 2.5 Q U is the median of {4, 5, 5, 6}: Q U = 5 IQR = 5 2.5 = 2.5 Numerical Description of Data, Jan 7, 2004-5 -

Percentiles More generally we might be interested in the value which is exceeded only by a certain percentage of observations: The pth percentile of a set of observations is the value such that p% of the observation are less than or equal to it and (100 p)% of the observation are greater than or equal to it. How to find the percentiles 1. Arrange the data into increasing order. 2. If np/100 is not an integer, then x (k+1) is the p th percentile, where k is the largest integer less than np/100. 3. If np/100 is an integer, the p th percentile is the average of the x (np/100) and x (np/100+1). Five-number summary A numerical summary of a distribution {x 1,..., x n } is given by x (1) Q L M Q U x (n) A simple boxplot is a graph of the five-number summary. Numerical Description of Data, Jan 7, 2004-6 -

Boxplots A common rule for discovering outliers is the 1.5 IQR rule: An observations is a suspected outlier if it lies more than falls more than 1.5 IQR below Q L or above Q U. How to draw a boxplot Box-and-whisker plot) 1. A box (the box) is drawn from the lower to the upper quartile (Q L and Q U ). 2. The median of the data is shown by a line in the box. 3. Lines (the whiskers) are drawn from the ends of the box to the most extreme observations within a distance of 1.5 IQR (Interquartile range). 4. Measurements falling outside 1.5 IQR from the ends of the box are potential outliers and marked by or. 10 0 10 20 Stock A Stock B Plotting a boxplot with STATA:. infile A B using stocks.txt, clear. label var A "Stock A". label var B "Stock B". graph box A B, xsize(2) ysize(5) Numerical Description of Data, Jan 7, 2004-7 -

Boxplots Interpretation of Box Plots The IQR is a measure for the sample s variability. If the whiskers differ in length the distribution of the data is probably skewed in the direction of the longer whisker. Very extreme observations (more than 3 IQR away from the lower resp. upper quartile) are outliers, with one of the following explanations: a) The measurement is incorrect (error in measurement process or data processing). b) The measurement belongs to a different population. c) The measurement is correct, but represents a rare (chance) event. We accept the last explanation only after carefully ruling out all others. Numerical Description of Data, Jan 7, 2004-8 -

Variance and standard deviation Suppose there are n observations x 1, x 2,..., x n, The variance of the n observations is: s 2 = (x 1 x) 2 + (x 2 x) 2 + + (x n x) 2 n 1 n (x i x) 2 = 1 n 1 i=1 This is (approximately) the average of the squared distances of the observations from the mean. The standard deviation is: s = s 2 = 1 n 1 n (x i x) 2 i=1 Why n 1? Division by n 1 instead of n in the variance calculation is a common cause of confusion. Why n 1? Note that n (x i x) = 0 i=1 Thus, if you know any n 1 of the differences, the last difference can be determined from the others. The number of freely varying observations, n 1 in this case, is called the degrees of freedom. Numerical Description of Data, Jan 7, 2004-9 -

Properties of s Measures spread around the mean = use only if the mean is used as a measure of center. s = 0 all observations are the same s is in the same units as the measurements, while s 2 is in the square of these units. s, like x is not resistant to outliers. Five-number summary versus standard deviation The 5-number summary is better for describing skewed distributions, since each side has a different spread. x and s are preferred for symmetric distributions with no outliers. Numerical Description of Data, Jan 7, 2004-10 -