Statistics I Chapter 2: Univariate data analysis

Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram, polygon, boxplot) Numerical measures to describe: central tendency (mean, median, mode) location (quartiles, percentiles) variation (variance, standard deviation, quasi-variance and quasi-standard-deviation, range, IQR, coefficient of variation)

Chapter 2: Univariate data analysis Recommended reading Peña, D., Romo, J., Introducción a la Estadística para las Ciencias Sociales Chapters 4, 5 Newbold, P. Estadística para los Negocios y la Economía (2009) Chapter 2

Graphical presentation of data Once we have a frequency distribution of the data, the following graphical displays can be obtained: Categorical piechart barchart Numerical histogram polygon boxplot

Graphs for qualitative data: piechart Example 1: The frequency table below corresponds to the data representing blood types reported for a sample of 40 individuals. Absolute Relative Class Frequency Frequency A 12 0.300 B 11 0.275 AB 8 0.200 O 9 0.225 Total 40 1

Piechart Example 1 cont.: Each slice is a fraction of the total size of the pie Many softwares rank slices alphabetically Although pretty harder to read than barcharts Avoid 3D piecharts, for those the area in the background seems to be smaller than the area in the foreground O 22.5% A 30% B 27.5% AB 20%

Graphs for qualitative data: barchart Example 2: The frequency table below corresponds to levels of satisfaction for 901 employees. Cumulative Cumulative Absolute Relative Absolute Relative Class Frequency Frequency Frequency Frequency VU 62 0.07 62 0.07 U 108 0.12 170 0.19 S 319 0.35 489 0.54 VS 412 0.46 901 1 Total 901 1

Barchart Example 2 cont.: Bars are of the same width and equally-spaced, with the heights corresponding to the frequencies There are gaps between the bars Bars are labeled with class names Many softwares rank bars alphabetically FREQUENCY 0 100 200 300 400 VU U S VS

Barchart Barcharts can also be constructed for discrete data if there are not too many values This is a barchart for Example 3 of Ch.1 where we looked at the number of leaves attacked by a pest for a sample of 50 plants FREQUENCY 0 2 4 6 8 10 12 0 1 2 3 4 5 6 7 8 9 10

Graphs for quantitative data: histogram and polygon Example: 4 The frequency distribution of the daily high temperature (in Fahrenheit) reported on 20 winter days is as follows: Class Interval Midpoint n i f i N i F i [10, 20) 15 3 0.15 3 0.15 [20, 30) 25 6 0.30 9 0.45 [30, 40) 35 5 0.25 14 0.70 [40, 50) 45 4 0.20 18 0.90 [50, 60) 15 2 0.10 20 1 Total 20 1

Histogram and polygon There are no gaps between the bars/bins Bin widths = widths of class intervals (identical), class boundaries are marked on the horizontal axis Bin heights = frequencies (here, absolute) Bin areas are proportional to the frequencies FREQUENCIES 0 1 2 3 4 5 6 Polygon 0 10 20 30 40 50 60 70 TEMP (F)

Histogram with area of 1 (on a density scale) Bin widths = widths of class intervals (not necessarily identical) Bin heights = Bin areas = f i f i l i l i 1 TOTAL AREA = 1 0.000 0.010 0.020 0.030 0 10 20 30 40 50 60 70 TEMP (F)

Describing data numerically New notation: Center Location Variation mean quartiles range median percentiles interquartile range mode variance standard deviation coeff. of variation n x i = x 1 + x 2 +... + x n i=1 ( : sum, i = 1: the lower limit, n: the upper limit, x i : example of a formula depending on i) Example: 3 i 2 = ( 1) 2 + 0 2 + 1 2 + 2 2 + 3 2 = 15 i= 1

Central tendency: (arithmetic) mean The most common measure of central tendency Population mean Sample mean N i=1 µ = x i N n i=1 x = x i = n = x1 +... + x N N x1 +... + xn n If a, b (b 0) are real numbers and y = a + bx, then Affected by extreme values (outliers) ȳ = a + b x Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 5, 4, 200 x = 3 + 1 + 5 + 4 + 2 5 = 3 ȳ = 3 + 1 + 5 + 4 + 200 5 = 42.6!

Central tendency: median In the ordered list, the median M is the middle number { x((n+1)/2) if n odd (the middle number) M = x (n/2) +x (n/2+1) if n even (the average of the two middle numbers) 2 (x (1), x (2),..., x (n) means that the observations are ranked in increasing order, eg. x (1) = x min, x (n) = x max) Not affected by outliers Example: Given observations 3, 1, 5, 4, 2 (n = 5), first rank the data 1,2, 3,4,5, then identify the middle number(s) M = x ((5+1)/2) = 3rd smallest {}}{ x (3) = 3 Example: Given observations 3, 1, 5, 4, 2, 0 (n = 6), first rank the data 0,1, 2,3,4,5, then identify the middle number(s) M = x (6/2) + x (6/2+1) 2 = the average of 3rd and 4th {}}{ x (3) + x (4) 2 = 2 + 3 2 = 2.5

Central tendency: mode The value that occurs most often Not affected by outliers Used for either numerical or categorical data There may be no mode, there may be several modes Example: Given observations 3, 1, 5, 4, 2, there is no mode Example: Given observations 3, 1, 5, 4, 2, 1, the mode is 1

Shape: comparing mean and median Three types of distributions: Skewed to the left Mean < Median Symmetric Mean = Median Skewed to the right Median < Mean LEFT SKEWED x < M SYMMETRIC x = M RIGHT SKEWED M < x Note: The distribution in the middle is known as bell-shaped or normal

Quartiles and percentiles Quartiles split the ranked data into four segments with an equal number of values per segment The first quartile Q 1 has position 1 (n + 1) 4 The second quartile Q 2 (= median) has position 1 (n + 1) 2 The third quartile Q 3 has position 3 (n + 1) 4 Example: Given observations 22, 18, 17, 16, 16, 13, 12, 21, 11 (n = 9), first rank the data 11, 12, 13, 16, 16, 17, 18, 21, 22, then identify the positions Q 1 = x (2.5) = x (3) = 12 Q 2 = 16 Q 3 = x (7.5) = x (8) = 21 pth percentile, p = 1, 2,..., 99, P k = x (k(n+1)/100). Example cont.: 60th percentile = x (60(9+1)/100) = x (6) = 17

Variation: range and interquartile range (IQR) Range is the simplest measure of variation R = x max x min Ignores the way the data is distributed Sensitive to outliers Example: Given observations 3, 1, 5, 4, 2, R = 5 1 = 4 Example: Given observations 3, 1, 5, 4, 100, R = 100 1 = 99 Interquartile range (IQR) can eliminate some outlier problems. Eliminate high and low observations and calculate the range of the middle 50% of the data IQR = 3rd quartile 1st quartile = Q 3 Q 1

Variation: Interquartile range and boxplot Outliers are observations that fall below the value of Q1 1.5 IQR above the value of Q3 + 1.5 IQR For extreme outliers, replace 1.5 by 3 in the above definition MEDIAN x min Q 1 (Q 2 ) Q 3 x max 25% 25% 25% 25% 12 24 31 42 58 IQR=18

Measure of variation: variance Average of squared deviations of values from the mean Population variance Sample variance n ˆσ 2 i=1 = (x i x) 2 n N σ 2 i=1 = (x i µ) 2 N faster to calculate { }}{ n i=1 = x i 2 n( x) 2 n divided by n Sample quasi-variance (corrected sample variance) n s 2 i=1 = (x i x) 2 n 1 They are related via = n i=1 x 2 i n( x) 2 n 1 ˆσ 2 = n 1 n s2 divided by n 1 If a, b (b 0) are real numbers and y = a + bx, then s 2 y = b 2 s 2 x

Measure of variation: standard deviation (SD) The most-commonly used measure of spread Population standard deviation, sample standard deviation and sample quasi-standard deviation are respectively Shows variation about the mean σ = σ 2 ˆσ = ˆσ 2 s = s 2 Has the same units as the original data, whilst variance is in units 2 Variance and SD are both affected by outliers

Calculating variance and standard deviation Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 124 8 = 15.5 ȳ = 124 8 = 15.5 z = 124 8 = 15.5 n i=1 n i=1 n i=1 x 2 i = 11 2 + 12 2 +... + 21 2 = 2000 y 2 i = 14 2 + 15 2 +... + 17 2 = 1928 z 2 i = 11 2 + 11 2 +... + 20 2 = 2068 n sx 2 i=1 = x i 2 n( x) 2 2000 8(15.5)2 = = 78 = 11.1429 sx = 3.3381 n 1 8 1 7 sy 2 1928 8(15.5)2 = = 6 = 0.8571 sy = 0.9258 8 1 7 sz 2 2068 8(15.5)2 = = 146 = 20.8571 sz = 4.5670 8 1 7

Comparing standard deviations Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 15.5 s x = 3.3 11 12 13 14 15 16 17 18 19 20 21 y = 15.5 s y = 0.9 11 12 13 14 15 16 17 18 19 20 21 z = 15.5 s z = 4.6 11 12 13 14 15 16 17 18 19 20 21

Numerical summaries and frequency tables. Standarization. If the data is discrete then k i=1 x = x in i n and s 2 = k i=1 x 2 i n i n x 2 n 1 If the data is continuous, we replace x i in the above difinition, by the mid-points of class intervals To standardize variable x means to calculate x x s If you apply this formula to all observations x 1,..., x n and call the transformed ones z 1,..., z n, then the mean of the z s is zero with the standard deviation of one Standarization = finding z-score

Empirical rule If the data is bell-shaped (normal), that is, symmetric and with light tails, the following rule holds: 68% of the data are in ( x 1s, x + 1s) 95% of the data are in ( x 2s, x + 2s) 99.7% of the data are in ( x 3s, x + 3s) Note: This rule is also known as 68-95-99.7 rule Example: We know that for a sample of 100 observations, the mean is 40 and the quasi-standard deviation is 5. Assuming that the data is bell-shaped, give the limits of an interval that captures 95% of the observations. 95% of x i s are in: ( x ± 2s) = (40 ± 2(5)) = (30, 50)

Measure of variation: coefficient of variation (CV) Measures relative variation and is defined as CV = s x Is a unitless number (sometimes given in % s) Shows variation relative to mean Example: Stock A: Average price last year = 50, Standard deviation = 5 Stock B: Average price last year = 100, Standard deviation = 5 CV A = 5 50 = 0.10 CV B = 5 100 = 0.05 Both stocks have the same SDs, but stock B is less variable relative to its mean price