CENTRAL LIMIT THEOREM (CLT)

Size: px

Start display at page:

Download "CENTRAL LIMIT THEOREM (CLT)"

Sophie Williamson
6 years ago
Views:

1 CENTRAL LIMIT THEOREM (CLT) A sampling distribution is the probability distribution of the sample statistic that is formed when samples of size n are repeatedly taken from a population. If the sample statistic is the sample mean, then the distribution is the sampling distribution of sample means. CLT in layman's terms - if we collect samples of the same size (n) from the same population, compute the means for each sample, and then develop a histogram of those means, the histogram will be shaped approximately like the bell shape of the distribution. This will be true regardless of the shape of the original population's distribution. This approximation will get better as we take larger samples. The mean of the sample means will be the same as the mean of the original population and the standard deviation of the sample means will be the original standard deviation. CLT: 1. If samples of size n, where n 30, are drawn from any population with a mean µ and a standard deviation σ, then the sampling distribution of sample means approximates a normal distribution. The greater the sample size, the better the approximation. 2. If the original population is normally distributed, the sampling distribution of sample means is normally distributed for any sample size n. 3. CONFIDENCE INTERVALS FOR THE POPULATION MEAN (large samples) Inferential statistics uses sample statistics to estimate the value of an unknown population parameter. A point estimate is a single value estimate for a population parameter. The most unbiased point estimate of the population mean µ is the. An interval estimate is an interval, or range of values, used to estimate a population parameter. Although you can assume that the point estimate (sample mean) is not equal to the actual population mean, it is probably close to it. In other words, each sample mean has some small error associated with it in comparison to the population mean. No sample mean is probably perfectly representative of the population. To form an interval estimate, use the point estimate as the center of the interval, than add and subtract a margin of error.

2 So a c - confidence interval for the population mean µ is x - E < µ < x + E, where c, the level of confidence, is the probability that the confidence interval contains µ. If c = 90%, there is a probability of.90 that this interval contains µ. E is the margin of error of the point estimate based on how confident you want to be that your interval contains µ. E is also called the maximum error of the estimate. E is essentially the small difference between the sample mean and the perfect population mean. E = When n 30, the sample standard deviation s can be used in place of σ. The level of confidence, c, is the area under the standard normal curve between the critical values, - z c and z c. The LEVEL OF CONFIDENCE is the probability (or percentage) that the confidence interval contains the true value of the population parameter. A CRITICAL VALUE is the number on the z line separating z scores that are likely to occur from those that are unlikely to occur. There is an area (probability) of c = 1-α between the vertical borderlines at -z α/2 and z α/2. Common choices for the degree of confidence are 90%(α =.10), 95%(α =.05), and 99%(α =.01). z c = z α/2 is the positive z value that is at the vertical boundary separating an area of α /2 in the right tail of the standard normal distribution. The values of - z α/2 is at the vertical boundary for the area of α /2 in the. z.05 = and it is the positive critical value for 90% confidence intervals. z.025 = 1.96 and it is the positive critical value for 95% confidence intervals. z.005 = and it is the positive critical value for 99% confidence intervals. CONFIDENCE INTERVALS FOR THE MEAN (small

3 samples) In the CLT, we learned that as long as n 30 we could assume that the distribution of the sampling means was approximately normally distributed. It didn't even matter what the distribution of the original population was. On the other hand, we learned that if the original population was normally distributed, then the distribution of the sampling means would be approximately normal regardless of how large n was. However, all of this assumed we knew what the population standard deviation was. As we saw when we did confidence intervals, the standard deviation is probably not really known. When we did confidence intervals for large samples, we substituted the sample standard deviation s for the population standard deviation σ. It was probably a decent substitution because n 30. In this section, the samples are small (n < 30). This is not a problem as long as the original population is normally distributed and the population standard deviation is known. In that case, you can assume the sampling means are distributed approximately normally and construct the confidence intervals exactly as you did for large samples, E =. However, again it is seldom the case that you know the population standard deviation. Unfortunately, the sample standard deviation is probably not as good a substitute for the population standard deviation now that the sample is (n < 30). We will use s as a substitute for σ, but we will make a further compensation since it is probably not that good of a substitute. The further compensation will be that we will use the t -distribution. Also, remember that the original population must be normally distributed since we are dealing with small samples. If the original population is not normally distributed, we cannot construct confidence intervals for small samples. So, the Student - t distribution is used for a)small samples when you know the b)original population is normally distributed and you c)don't know σ Under these conditions, we say that the sample means have a t - distribution where the t score of any sample mean is given by t =

4 The t - distribution is bell shaped and symmetric about the mean. The total area under a t-curve is 1. The mean, median, and mode of the t - distribution are equal to zero. All of the characteristics are the same as the normal distribution. But the t - distribution is a family of curves, each determined by a parameter called the degrees of freedom. When you use a t -distribution to estimate a population mean, the degrees of freedom are equal to one less than the sample size. So, d.f. = n - 1. HYPOTHESIS TESTING Let s suppose that we are out in the old west in a saloon drinking and gambling with notorious cowboys. One day we decide to gamble about whether a coin flip results in either a head or a tail. After playing for a while and losing considerable money, we hit upon the idea that maybe this coin that has been offered by one unscrupulous cowboy may not be fair. Should we bring up this question? In all likelihood, if we accuse him of using an unfair coin and we are proven wrong we will probably be strung up. However, if we continue to play and the coin is indeed unfair, we are going to lose all our money. (Remember these two possibilities later.) In order to prove (statistically) whether or not the coin is fair, the local sheriff decides to conduct an experiment. He will flip the coin 100 times and record the results. Let x represent the number of heads that Sheriff Lobo gets when he flips the coin 100 times. Note that the distribution of x is a distribution since a flip either results in a head or a tail. Also, n = 100. But what is p, the probability of getting a head on any one flip? That s what is really in question here. The unscrupulous looking cowboy who provided the coin claims that p =.5 (µ = 50) and we are claiming that p.5 (µ 50). In order to calculate certain probabilities, we must have a p to work with. So, we will assume that p =.5 (µ = 50). Then we will try to disprove it. This is an important point. In hypothesis testing, we will always assume that the value of a variable is something rather than is not something, regardless of what we are trying to prove. Getting back to our distribution, we now can say that µ = np = 100(.5) = 50 and σ = = 5, assuming the coin is fair. Also, note that P(x=50)=.0796, P(49)=.0780=P(51), P(48)=.0735=P(52), P(47)=.0666=P(53). These results come from the binomial formula P(x)= n C x p x q (n-x). It is interesting to note that the histogram of these binomial probabilities is approximately normal.

5 These are important numbers to look at. They should convince you not to shoot the cowboy if the sheriff gets say 49 heads out of 100 flips. The probability of getting exactly 50 heads is very small and only slightly larger than getting 49 or 51 heads. In other words, it is almost as probable that a fair coin would result in 49 heads as 50 heads. So, this brings up the most important question: how do you use this experiment to decide whether a coin is fair or not. Getting 49 heads is obviously not enough reason to pull out your 6 shooter. According to the histogram, 30 heads out of 100 would certainly be enough reason. Also 70 heads looks like a reason to pump some lead into our fellow gambler. But are 30 and 70 the right limits? Let s suppose also that you have only two choices: 1. Decide statistically that the coin is unfair and accuse him of cheating or 2. Decide statistically that the coin is fair and continue to gamble your hard earned money. You have to set the limits of the decision making process on these two choices and their possible consequences. You have to understand that this test in not conclusive evidence. It only tells you what is probably true. The test could mislead you. It is possible to flip a fair coin 100 times and only get 10 heads. It s possible, but not probable. So here are the consequences if your statistics mislead you and the improbable but possible does occur: I. If your statistics tell you to reject the coin and call him a cheater when it is actually a fair coin, he will probably shoot you. II. If your statistics tell you to accept the coin and continue gambling when it is actually an unfair coin, then you will lose all your money (but not your life). So you must set your limits so as to balance these two possible mistakes. Obviously the first mistake is more critical. It s not good to make the second mistake, but it s not nearly as costly. It turns out that setting our limits at 30 (4σ from 50) and 70 (4σ from 50) would cause us to make the second mistake too often and we would lose all our money. We would often continue to gamble with and unfair coin. It turns out that setting our limits at 45 (1σ from 50) and 55 (1σ from 50) would cause us to make the first mistake too often and we would get shot. We would often accuse an innocent man of cheating. The best tradeoff probably occurs at x = 40 ( from 50) and x = 60 ( from 50). P(40 x 60).95 and P( x < 40 or x > 60).05. If the coin is indeed fair, we have a 95% chance of detecting it.

6 Also, we have only a 5% chance of calling it unfair when it is actually fair. So, now let s actually have the sheriff perform the test. Let s imagine two different outcomes. 1. Sheriff flips the coin 100 times and gets 42 heads. Based on our limits, we decide the coin is probably fair and we continue to gamble. In other words, there is not sufficient sample evidence to support the claim that the coin was unfair. We could be making a Type II error. 2. Sheriff flips the coin 100 times and gets 38 heads. Based on our limits, we decide the coin is probably unfair and accuse the cowboy of cheating. In other words, the sample data support the claim that the coin is unfair. We could be making a Type I error. Hypothesis Testing Definitions: 1. SYMBOLIC CLAIM - claim put into mathematical symbols 2. NULL HYPOTHESIS - H 0, hypothesis about what µ is, must contain the condition of equality 3. ALTERNATIVE HYPOTHESIS - H a, hypothesis about what µ isn't, must not contain the condition of equality 4. TYPE I ERROR - mistake of rejecting H 0 when it is actually. 5. SIGNIFICANCE LEVEL - α, probability of a Type I error, total area of tails 6. TAIL - critical region, region of standard normal curve containing values that would conflict significantly with H 0 7. CRITICAL VALUE - CV, z score which sets the vertical boundary of the tail 8. TEST STATISTIC - TS, z score of our sample mean 9. TYPE II ERROR - mistake of FTR H 0 when it is actually false 10. β - probability of a Type II error In hypothesis testing, it is easy to get lost in the mechanics of the test and lose sight of what is really happening. All we are really doing is comparing an actual sample mean to a proposed population mean. You should recognize that the test statistic is just a z score representing the number of standard deviations that our sample mean is away from the claimed population mean µ in the null hypothesis. If the sample mean and the claimed population mean are not significantly different, we begin to feel that the claimed value of µ in the null hypothesis is probably the true value. The TS doesn't fall in the critical region and we fail to reject the null hypothesis (FTR H 0.) But if our sample mean is very far away from the claimed value of µ in the null hypothesis, then this unusual result leads us to believe that the claimed value of µ in the null hypothesis is incorrect. In this case, the sample mean and the claimed population mean are significantly different. The TS does fall in the critical region and we reject H 0.

7 Hypothesis Testing Steps: 1. State claim in symbolic form. Identify H 0 and H a. Remember that H 0 must contain the condition of equality and H a must not. 2. Specify the significance level and the sample statistics. This is always given. 3. Draw a graph and include µ from H 0 on the centerline of your graph! 4. Identify whether the test is left, right, or two-tailed. Remember that the sign in H a points to the tail. If it is a two tailed test, remember to divide α equally between the two tails. 5. Use the significance level α to determine the critical value(s) (CV). Shade the critical region and include the CV on your graph. 6. Determine the test statistic (TS) and include it on your graph. 7. Reject H 0 if the TS is in the critical region. Fail to reject H 0 if the TS is not in the critical region. 8. Restate the decision in #7 in simple, nontechnical terms using flow chart.

Probability and Statistics

Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT