The Economist, September 6, 214 1 Histograms, Central Tendency, and Variability Lecture 2 Reading: Sections 5 5.6 Includes ALL margin notes and boxes: For Example, Guided Example, Notation Alert, Just Checking, Optional Math Boxes, What Can Go Wrong? and Ethics in Action 2 Histogram graphically describes how a single variable containing interval data is distributed Range of data divided into non-overlapping and equal width classes (bins) that cover range of values Histogram Fraction n = 174 countries 2 4 6 Inflation Rate, 211 How many bins? Width of bins? http://data.worldbank.org/indicator/fp.cpi.totl.zg 3 Lecture 2, Page 1 of 11
8 n = 174 countries n = 174 countries Frequency 6 4 2 Fraction 2 4 6 Inflation Rate, 211 2 4 6 Inflation Rate, 211.8.6 n = 174 countries 2 4 6 Inflation Rate, 211 Frequency histogram: Bar height number of observations in bin Relative frequency histogram: Bar height fraction of obs. in bin histogram: Bar area measures the fraction of observations in bin 4 2 4 6 Inflation Rate, 211 5 2 4 6 Inflation Rate, 211.5 2 4 6 Inflation Rate, 211 1.5 1.5 Number of bins changes the appearance of the histogram Sturges formula: # of bins = 1 + 3*log(n) [Note log base 1.] 2 4 6 Inflation Rate, 211 OECD inflation: 1 + 3*log(34) = 6.5 6 [but STATA picked 5] 6 Lecture 2, Page 2 of 11
Shape of Things Histogram gives overview of a variable with a single picture Can make informal inferences about the shape of population Symmetric: If draw an imaginary line at center, have mirror image on each side Positively skewed: long tail to right (aka skewed to the right) Negatively skewed: long tail to left (aka skewed to the left) Modality: # major peaks Bell/Normal/Gaussian 7 Four Perfectly Symmetric Histograms.5 1 2 3 4 1 2 3 4 5 2 3 4 5 6.5 5 5 2 4 6 8 8 Four Positively Skewed Histograms.5 5 5 1 15-6 -4-2 2 4.6.8-11 -1-9 -8-7 -6 5 1 15 2 25 9 Lecture 2, Page 3 of 11
Four Negatively Skewed Histograms.5 5 5 1 15 2-15 -1-5.6.8-15 -14-13 -12-11 -1 25 3 35 4 45 5 1.5 5 5.5 5 5 Four Bimodal Histograms 2 4 6 8 2 4 6 8 1.6.8.5 5 1 2 3 5 1 15 11 Two Perfectly Bell Shaped Histograms -4-2 2 4 1 15 2 25 3 Are these symmetric? Skewed? Unimodal? Take a good look because this is one of the last times you will see perfect bell shaped (normal) histograms 12 Lecture 2, Page 4 of 11
.5 n = 185 countries; Source CIA Fraction 2 4 6 8 1 GDP per capita (PPP), 212 est. https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_24.txt 13 Samples vs. Populations Sample is a random subset of population Sampling noise: Chance differences between population and a random sample Driven by the sample size, not sample size relative to the population size, which is assumed infinite (pp. 3 31, The Sample Size is What Matters ) Informal inference: consider sample size (n) Never see the perfect forms (Plato): statements about shape always approximate Nearly Normal Condition 14 5 5 Population, N = 1,,.5 Sample 1; n = 1.5 5 1 15 8 9 1 11.5 Sample 2; n = 1 How many samples would you have in real life? 5.5 8 9 1 11 12 Why aren t these samples perfectly Bell shaped? Sample 3; n = 1 6 8 1 12 15 Lecture 2, Page 5 of 11
5 5.5 Population, N = 1,, 5 1 15 5 5.5 Sample 1; n = 3 Why are there more bins than last slide? 6 8 1 12 14 Sample 2; n = 3 Sample 3; n = 3 7 8 9 1 11 12 6 7 8 9 1 11 16 5 5.5 Population, N = 1,, 5 1 15 Sample 1; n = 1 6 8 1 12 14 Sample 2; n = 1 Sample 3; n = 1 6 8 1 12 14 5 1 15 17 What to Conclude About Shape? 5 5.5 n: 3-4 -2 2 4 X n: 5.8.6 1 11 12 Y 13 14 Is the graph on the left symmetric? Bell shaped? Is the graph on the right symmetric? Bell shaped? Bi-modal? 18 Lecture 2, Page 6 of 11
Hsieh and Olken (214) JEP The Missing Missing Middle Summer 214 http://pubs.aeaweb.org/doi/pdfplus/1257/jep8.89 19 Appendix, http://www.aeaweb.org/jep/app/283/28389_app.pdf There is a clear bimodality in the distribution of valueadded/capital for the large firms. However, the capital questionnaire for large firms was ambiguous as to whether the results were to be entered in thousands of Rupiah or Rupiah. Our best guess is that approximately half the firms used thousands and half used millions. 2 Summary Statistics Statistics (i.e. summary statistics) give a concise idea of what data look like For a single variable, statistics can give numeric measures of: Central tendency: mean and median Variability: range, variance, standard deviation, coefficient of variation, R Relative standing: percentiles For two variables, also measure relationship 21 Lecture 2, Page 7 of 11
Mean and Median Population mean, a parameter: μ = N Sample mean, a statistic: X = n n Which is subject to sampling error? i=1 x i N i=1 x i Median is the middle obs. after sorting if even # of obs., average 2 middle ones Fraction.5 mean = 3, median = 3 2 4 6 Inflation Rate, 211 22 5 5.5 Normal Distribution Population mu=1., med=1. 5 1 15 Is X (a statistic) equal to µ (a parameter)? Sample, n=49 X-bar=13.9, med=13. 6 8 1 12 14 Why is the population mean equal to the population median? Why isn t the sample mean equal to the sample median? 23 A Symmetric Distribution (Uniform).8.6 Population mu=5., med=5. 2 4 6 8 1 Book Rating Sample, n=41 X-bar=55, med=62 5.5 2 4 6 8 1 Book Rating Why is the population mean equal to the population median? Why is the sample median different from the population median? 24 Lecture 2, Page 8 of 11
Fraction n = 174 countries mean = 6.6, median = 5. Why is the mean greater than the median? 2 4 6 Inflation Rate, 211 25 Measures of Variability (Spread) Range: max min Variance: N i=1 σ 2 = x i μ 2 N s 2 = n x 2 i=1 i X n 1 Standard deviation: s = variance Coefficient of variation (textbook) min = -, max = 6.5 var = 1.6, sd = 1 2 4 6 Inflation Rate, 211 26 Breaking Down Variance Numerator: total sum of squares (TSS) If all sampled countries have 3% inflation (x i = 3 for all i), what would TSS & s 2 be? Denominator: ν ( nu ) Only n 1 free obs left after calculate mean Units of variance? How about s.d.? s 2 = n x 2 i=1 i X n 1 n TSS = x i X 2 i=1 degrees of freedom: ν = n 1 27 Lecture 2, Page 9 of 11
Empirical Rule (Normal/Bell) If a random sample is drawn from a Normal population then about: 68% of observations will lie within 1 s.d. of the mean (i.e. between X s and X + s) 95% of observations will lie within 2 s.d. of the mean (i.e. between X 2s and X + 2s) 99.7% of observations will lie within 3 s.d. of the mean (i.e. between X 3s and X + 3s) Empirical Rule only applies if normal 28 n = 246, mean = 99, sd = 14.5 6 8 1 12 14 29.5 Histogram #1, n = 94 8 9 1 11 12 X.8.6 Histogram #2, n = 154 8 9 1 11 12 X Histogram #3, n = 298 5 Histogram #4, n = 521 5 5.e-4.5 5 1 15 X 9 95 1 15 11 X Approximately, what is the s.d. of the variable in each histogram? 3 Lecture 2, Page 1 of 11
Chebysheff s Theorem At least 1*(1 1/k 2 )% of observations lie within k s.d. s of the mean for k>1 At least 75% of obs. lie within 2 s.d. of mean 1 1/2 2 = 3/4 At least 89% of obs. lie within 3 s.d. of mean 1-1/3 2 = 8/9 Can be applied to all samples no matter how population is distributed What about within one s.d.? 31.5 n = 185 countries mean = 14955, sd = 16243. Fraction 2 4 6 8 1 GDP per capita (PPP), 212 est. 32 Lecture 2, Page 11 of 11