Eco411 Lab: Central Limit Theorem, Normal Distribution, and Journey to Girl State

Eco411 Lab: Cetral Limit Theorem, Normal Distributio, ad Jourey to Girl State 1. Some studets may woder why the magic umber 1.96 or 2 (called critical values) is so importat i statistics. Where do they come from? Ok, this is a log log story. Hopefully I ca make it fu ad easy to uderstad. 2. Simply put, cetral limit theorem (CLT) implies that the ormal distributio plays a key role i statistics, ad 1.96 is the 97.5-th percetile of the stadard ormal distributio. Warig! You should read this setece at least two more times, ad thik about it, before you go ahead. 3. I wat to use simulatio (Mote Carlo experimet) to show the mai idea of CLT, which states that the sample average (mea) approaches the ormal distributio as the sample size rises, regardless of the distributio of the origial data. I terms of math, o matter which distributio origial y follows, the sample mea ȳ becomes more ad more like a ormal radom variable. This is a strog statemet. It is like, o matter who your are, i the ed you will like Dr. Li s teachig... 4. To get Mote Carlo started, I first geerate a populatio of 100,000 observatios of a zero-oe dummy variable. So by costructio the origial variable y does ot follow ormal distributio. I fact it follows a Beroulli distributio give as P (y = 1) = p = 0.8, P (y = 0) = 1 p = 0.2 (1) where I made up the umber 0.8. The expected value ad variace are give by E(y) = p = 0.8, var(y) = p(1 p) = 0.16 (2) For those quatitative guys, ca you prove above results? 5. To help uderstad, you ca thik of a fictitious coutry called Girl State, which exists i a famous Chiese ovel called Jourey to the West. Girl State has a populatio of 100,000. This coutry is special because 80 percet populatio are female (y = 1), while oly 20 percet are male (y = 0) 1. This kid of zero-oe variable defiitely is ot a ormal radom variable. 1 Male is always eeded, otherwise the coutry will go extict i log ru. 1

6. To visualize the o-ormality, we draw the histogram for the origial data y : Desity 0 10 20 30 40 0.2.4.6.8 1 y This clearly does ot look like a bell, or a ormal distributio. 7. Next we wat to draw 1000 samples, ad each sample is small with oly 10 observatios. For each sample we ca compute a sample mea ȳ, or i this case, the sample proportio of y = 1. Agai let me use laguage everyoe uderstads. It is like the kig asks you to visit 1000 small villages, ad there are oly 10 residets i each village. The kig wats to kow the geder break-up for each village. 8. Because each village is small, you expect big variatio i ȳ. That is, for village that has o male, ȳ = 1; for village that has o female, ȳ = 0. The latter case is ulikely, but still possible. I geeral, ȳ varies across villages. So ȳ is a radom variable 2, whose distributio is called samplig distributio. Statistics is largely cocered with usig ȳ to make iferece to p. 3 9. The graph below is the histogram (distributio) of 1000 ȳ. Remember, we get oe ȳ for each village: 2 You do ot kow ȳ before you go to a specific village, so it is radom. 3 I reality p is geerally ukow, uless we do simulatio. 2

Desity 0 5 10 15.2.4.6.8 1 ybar10 I have several remarks: (a) eve though y ca oly take two values of 0 ad 1, ȳ ca take values of 0, 0.1, 0.2,...1. We see more tha two bars i the histogram. (b) Because 80 percet populatio are female, we have substatial umber of villages that have o male (ȳ = 1, the rightmost bar). O the other had, a o-female village (ȳ = 0) is hard to fid. This ca be see from the height of bar i the histogram. (c) The most likely ȳ (the highest bar) is 0.8, which is equal to the populatio mea p = 0.8. 10. Notice that this secod histogram is more symmetric tha the first histogram. We ca almost see a bell, eve though a asymmetric oe. The cetral limit theorem is kickig i ow! 11. Also ote that the horizotal distace betwee bars gets smaller. I the limit whe the sample size is ifiity, the bars become immediately adjacet, meaig that the ormal radom variable is cotiuous (but Beroulli is discrete). 12. CLT is a example of asymptotic theory describig what would happe whe the sample gets larger ad larger. So ext I will let the sample size icrease. 13. It is like the kig ow asks you to check city, other tha village. So we will icrease the sample size from 10 to 100. We visit 1000 cities, ad for each city we compute the sample mea ȳ. The histogram of the 1000 city averages is below 3

Desity 0 5 10 15.65.7.75.8.85.9 ybar100 If we igore the gaps i the graph, we almost see a symmetric bell! Put differetly, compared to the village average, the city average is more like a ormal radom variable. Yes, this is what the cetral limit theorem is about. Also otice that the dispersio of the third histogram is less tha the secod histogram. Mathematically we ca show E(ȳ) = p = 0.8, var(ȳ) = var(y) = p(1 p) = 0.16 (3) The first equatio above implies that o average, ȳ is a accurate estimate of p (called ubiased estimator). The secod equatio idicates that a bigger sample () ca lead to smaller variatio i ȳ, or equivaletly, lead to a more precise estimate. That is why we prefer a big sample over small oe. 14. Note var(ȳ) = 0 whe =. So i the limit the sample mea equals the costat p. This is called Law of Large Number. Loosely speakig, a sufficietly large sample ca give you a sample mea as close as possible to the actual populatio mea. The ituitio is, whe the sample gets larger, it coverges to the populatio, ad o woder the sample mea coverges to the populatio mea. 15. Most importatly, the cetral limit theorem says ȳ N ( p, ) p(1 p) as (4) where N( ) represets ormal distributio. Pay attetio here. It is ȳ, ot y, that coverges to ormal distributio. I this case our y always remais uchaged as a 4

Beroulli variable. The CLT is about ȳ. 16. I theory the sample size should be ifiitely large for CLT to hold. I practice we ca get ȳ very close to ormality for as small as 20. 17. I order to get a stadard ormal distributio with mea value of zero ad variace of oe, we apply the process of stadardizig (obtaiig the z-score). That is, we subtract the p, which is the E(ȳ), ad divide by the square root of var(ȳ), which is called stadard error ȳ p p(1 p) 18. The histogram of the stadardized ȳ is below N (0, 1) as (5) Desity 0.2.4.6 4 2 0 2 4 zybar100 Now we see the value 0 is i the ceter (0.8 is i the ceter before stadardizatio). 19. Normal distributio ca appear i a uexpected but atural way. Imagie you are lookig at the satellite image of a parkig lot outside a shoppig mall. You will see most cars parked directly i frot of the etrace, ad the umber of cars decreases gradually away from the etrace, just like a ormal distributio. 20. Now I ca show you where is the magic umber 1.96 or 2. After we sort the stadardized ȳ i ascedig order, 1.96 (or 2) is the 975-th observatio i the sorted series of 1000 stadardized ȳ! I other words P (stadard ormal < 1.96) = 0.975 (6) P ( 1.96 < stadard ormal < 1.96) = 0.95 (7) 5

The last iequality is the basis for cofidece iterval. 21. Cofidece iterval allows us to say somethig almost certai (with 0.95 probability) about somethig totally radom. I other words, the stadard ormal radom variable is radom, but is NOT that radom the most likely values are betwee -1.96 ad 1.96. 22. For a geeral ormal radom variable, the most likely values are betwee sample mea ± 1.96 times stadard deviatio. Mathematically 95 percet cofidece iterval = [µ 1.96σ, µ + 1.96σ] (8) 23. The stata code for doig this Mote Carlo is below clear set more off set obs 100000 set seed 12345 capture drop y ybar10 ybar100 ge y = (uiform()>0.2) * draw 1000 samples; each sample cotais 10 obs; draw histogram of the sample me ge ybar10 =. forvalues i = 1(1)1000 { local 0 = ( i -1)*10+1 local = 0 + 9 qui sum y i 0 / qui replace ybar10 = r(mea) i i } histogram ybar10 i 1/1000 sum ybar10 i 1/1000, detail * draw 1000 samples; each sample cotais 100 obs; draw histogram of the sample m ge ybar100 =. forvalues i = 1(1)1000 { local 0 = ( i -1)*100+1 local = 0 + 99 6

qui sum y i 0 / qui replace ybar100 = r(mea) i i } histogram ybar100 i 1/1000 sum ybar100 i 1/1000, detail * Where does 1.96 or 2 come from? * Aswer: Stadardize ybar100, sort them, ad 1.96 (or 2) is the 97.5 precetile ge zybar100 = (ybar100-0.8)/sqrt(0.16/100) histogram zybar100 i 1/1000 sort zybar100 list zybar100 i 975 24. Last commet. The fuy lookig stadardized ȳ has a popular ame: it is called t statistic (t value, t ratio...) I called it pada i class! t value ȳ p p(1 p) 7