Chapter 18 Summary Sampling Distribution Models

Uit 5 Itroductio to Iferece Chapter 18 Summary Samplig Distributio Models What have we leared? Sample proportios ad meas will vary from sample to sample that s samplig error (samplig variability). Samplig variability may be uavoidable, but it is also predictable! We ve leared to describe the behavior of sample proportios whe our sample is radom ad large eough to expect at least 10 successes ad failures. We ve also leared to describe the behavior of sample meas (thaks to the CLT!) whe our sample is radom (ad larger if our data come from a populatio that s ot roughly uimodal ad symmetric). Modelig the Distributio of Sample Proportios Rather tha showig real repeated samples, imagie what would happe if we were to actually draw may samples. Now imagie what would happe if we looked at the sample proportios for these samples. What would the histogram of all the sample proportios look like? We would expect the histogram of the sample proportios to ceter at the true proportio, p, i the populatio. As far as the shape of the histogram goes, we ca simulate a buch of radom samples that we did t really draw. It turs out that the histogram is uimodal, symmetric, ad cetered at p. More specifically, it s a amazig ad fortuate fact that a Normal model is just the right oe for the histogram of sample proportios. To use a Normal model, we eed to specify its mea ad stadard deviatio. The mea of this particular Normal is at p. Whe workig with proportios, kowig the mea automatically gives us the stadard deviatio as well the stadard deviatio we will use is. So, the distributio of the sample proportios is modeled with a probability model that is N p, A picture of what we just discussed is as follows: How Good Is the Normal Model? The Normal model gets better as a good model for the distributio of sample proportios as the sample size gets bigger. Just how big of a sample do we eed? This will soo be revealed AP Statistics Page 1 2007

Uit 5 Itroductio to Iferece Assumptios ad Coditios Most models are useful oly whe specific assumptios are true. There are two assumptios i the case of the model for the distributio of sample proportios: 1. The sampled values must be idepedet of each other. 2. The sample size,, must be large eough. Assumptios are hard ofte impossible to check. That s why we assume them. Still, we eed to check whether the assumptios are reasoable by checkig coditios that provide iformatio about the assumptios. The correspodig coditios to check before usig the Normal to model the distributio of sample proportios are the 10% Coditio ad the Success/Failure Coditio. 1. 10% coditio: If samplig has ot bee made with replacemet, the the sample size,, must be o larger tha 10% of the populatio. 2. Success/failure coditio: The sample size has to be big eough so that both pˆ ad qˆ are greater tha 10. So, we eed a large eough sample that is ot too large. A Samplig Distributio Model for a Proportio A proportio is o loger just a computatio from a set of data. o It is ow a radom quatity that has a distributio. o This distributio is called the samplig distributio model for proportios. Eve though we deped o samplig distributio models, we ever actually get to see them. o We ever actually take repeated samples from the same populatio ad make a histogram. We oly imagie or simulate them. Still, samplig distributio models are importat because o they act as a bridge from the real world of data to the imagiary world of the statistic ad o eable us to say somethig about the populatio whe all we have is data from the real world. Provided that the sampled values are idepedet ad the sample size is large eough, the samplig distributio of is modeled by a Normal model with o Mea: ( ˆp) p o Stadard deviatio: SD( pˆ ) What About Quatitative Data? Proportios summarize categorical variables. The Normal samplig distributio model looks like it will be very useful. Ca we do somethig similar with quatitative data? We ca ideed. Eve more remarkable, ot oly ca we use all of the same cocepts, but almost the same model. Simulatig the Samplig Distributio of a Mea Like ay statistic computed from a radom sample, a sample mea also has a samplig distributio. We ca use simulatio to get a sese as to what the samplig distributio of the sample mea might look like AP Statistics Page 2 2007

Uit 5 Itroductio to Iferece Meas The Average of Oe Die Let s start with a simulatio of 10,000 tosses of a die. A histogram of the results is: ` Lookig at the average of two dice after a simulatio of 10,000 tosses (see above) The average of three dice after a simulatio of 10,000 tosses looks like (see above) The average of 5 dice after a simulatio of 10,000 tosses looks like (see below) The average of 20 dice after a simulatio of 10,000 tosses looks like (see below) Meas What the Simulatios Show As the sample size (umber of dice) gets larger, each sample average is more likely to be closer to the populatio mea. o So, we see the shape cotiuig to tighte aroud 3.5 Ad, it probably does ot shock you that the samplig distributio of a mea becomes Normal. The Fudametal Theorem of Statistics The samplig distributio of ay mea becomes Normal as the sample size grows. o All we eed is for the observatios to be idepedet ad collected with radomizatio. o We do t eve care about the shape of the populatio distributio! The Fudametal Theorem of Statistics is called the Cetral Limit Theorem (CLT). The CLT is surprisig ad a bit weird: o Not oly does the histogram of the sample meas get closer ad closer to the Normal model as the sample size grows, but this is true regardless of the shape of the populatio distributio. The CLT works better (ad faster) the closer the populatio model is to a Normal itself. It also works better for larger samples. The Fudametal Theorem of Statistics (cot.) The Cetral Limit Theorem (CLT) - The mea of a radom sample has a samplig distributio whose shape ca be approximated by a Normal model. The larger the sample, the better the approximatio will be. AP Statistics Page 3 2007

Uit 5 Itroductio to Iferece But Which Normal? The CLT says that the samplig distributio of ay mea or proportio is approximately Normal. But which Normal model? o For proportios, the samplig distributio is cetered at the populatio proportio. o For meas, it s cetered at the populatio mea. But what about the stadard deviatios? But Which Normal? (cot.) The Normal model for the samplig distributio of the mea has a stadard deviatio equal to SD y where σ is the populatio stadard deviatio. The Normal model for the samplig distributio of the proportio has a stadard deviatio equal to SD pˆ Assumptios ad Coditios The CLT requires remarkably few assumptios, so there are few coditios to check: 1. Radom Samplig Coditio: The data values must be sampled radomly or the cocept of a samplig distributio makes o sese. 2. Idepedece Assumptio: The sample values must be mutually idepedet. (Whe the sample is draw without replacemet, check the 10% coditio ) 3. Large Eough Sample Coditio: There is o oe-size-fits-all rule. Dimiishig Returs The stadard deviatio of the samplig distributio declies oly with the square root of the sample size. While we d always like a larger sample, the square root limits how much we ca make a sample tell about the populatio. (This is a example of the Law of Dimiishig Returs.) Stadard Error Both of the samplig distributios we ve looked at are Normal. For proportios SD pˆ For meas SD y Whe we do t kow p or σ, we re stuck, right? o Nope. We will use sample statistics to estimate these populatio parameters. o Wheever we estimate the stadard deviatio of a samplig distributio, we call it a stadard error. ˆˆ For a sample proportio, the stadard error is SE pˆ s For the sample mea, the stadard error is SE y Samplig Distributio Models Always remember that the statistic itself is a radom quatity. o We ca t kow what our statistic will be because it comes from a radom sample. Fortuately, for the mea ad proportio, the CLT tells us that we ca model their samplig distributio directly with a Normal model. AP Statistics Page 4 2007

Uit 5 Itroductio to Iferece Samplig Distributio Models (cot.) There are two basic truths about samplig distributios: 1. Samplig distributios arise because samples vary. Each radom sample will have differet cases ad, so, a differet value of the statistic. 2. Although we ca always simulate a samplig distributio, the Cetral Limit Theorem saves us the trouble for meas ad proportios. The Process Goig Ito the Samplig Distributio Model What Ca Go Wrog? Do t cofuse the samplig distributio with the distributio of the sample. o Whe you take a sample, you look at the distributio of the values, usually with a histogram, ad you may calculate summary statistics. o The samplig distributio is a imagiary collectio of the values that a statistic might have take for all radom samples the oe you got ad the oes you did t get. What Ca Go Wrog? (cot.) Beware of observatios that are ot idepedet. o The CLT depeds crucially o the assumptio of idepedece. o You ca t check this with your data you have to thik about how the data were gathered. Watch out for small samples from skewed populatios. o The more skewed the distributio, the larger the sample size we eed for the CLT to work. AP Statistics Page 5 2007