ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions

Sampling We use information from a sample to infer something about a population. When using random samples and randomized experiments, we cannot rule out the possibility of incorrect inferences. So we ask: How often would this method give a correct answer if we used it a large number of times?

Some Terminology A parameter is a number which describes some aspect of a population. In practice, we don t know the value of a parameter because we cannot directly examine/measure the entire population. A statistic is a number that can be computed from the sample data, without making use of any unknown parameters. In practice we often use statistics to estimate an unknown parameter.

Mnemonic Device Statistics come from Samples. Parameters come from Populations.

An Illustration According to the 2008 Health and Nutrition Examination Survey, the mean weight of the sample of American adult males was x = 191.5 pounds. 191.5 is a statistic. The population: all American adult males over the age of 20. The parameter: the mean weight of all the members of the population.

On Means We will always use µ to represent the mean of a population. This is a fixed parameter that is unknown when we use a sample for inference. We will always write x for the mean of the sample. This is the average of the observations in the sample.

The Key Question If the sample mean x is rarely exactly equal to the population mean µ and can vary from sample to sample, how can we consider it a reasonable estimate of µ?

The Answer... If we take larger and larger samples, the statistic x is guaranteed to get closer and closer to the parameter µ. This fact is known as the Law of Large Numbers.

The Law of Large Numbers 1 Recall: In the long run, the proportion of occurrences of a given outcome gets closer and closer to the probability of that outcome. E.g. the proportion of heads when tossing a fair coin gets closer to 1/2 in the long run. Similarly, in the long run, the average outcome gets close to the population mean.

The Law of Large Numbers 2 Using the basic laws of probability, we can prove the law of large numbers. The Law of Large Numbers applet is useful for illustrating the law.

A Word of Caution Only in the very long run does the sample mean get really close to the population mean, and so in this respect, the Law of Large Numbers is not very practical. However, the success of certain businesses, such as casinos and insurance companies, depends on the Law of Large numbers.

Sampling Distributions 1 The Law of Large Numbers = If we measure enough subjects the statistic x will eventually get close to the parameter µ. What if we can only take samples of a smaller size, say 10?

Sampling Distributions 2 What would happen if we took many samples of 10 subjects from this population? To answer this question: Take a large number of samples of size 10 from the population Calculate the sample mean x for each sample Make a histogram of the values of x Examine the distribution in the histogram (shape, center, spread, outliers, etc.)

By Way of Example... 1 High levels of dimethyl sulfide (DMS) in wine causes the wine to smell bad. Winemakers are thus interested in determining the odor threshold, the lowest concentration of DMS that the human nose can detect. The threshold varies from person to person, so we d like to find the mean threshold µ in the population of all adults. An SRS of size 10 yields the values 28 40 28 33 20 31 29 27 17 21 and thus we have a sample mean x = 27.4.

By Way of Example... 2 It turns out that the DMS odor threshold of adults follows a roughly Normal distribution with µ = 25 mg/l and standard deviation σ = 7 mg/l. By following the procedure outlined before (taking 1,000 SRS s), we produce a histogram that displays the distribution of the values of x from the 1,000 SRS s. This histogram displays the sampling distribution of the statistic x.

By Way of Example... 3

The Official Definition The sampling distribution of a statistic is the distribution of values taken by the statistic over all possible samples of some fixed size from the population. Thus, the histogram on the previous slide actually displays an approximation to the sampling distribution of the statistic x. Important point: The sample mean is a random variable! Since good samples are chosen randomly, statistics such as the sample mean x are random variables. Thus we can describe the behavior of a sample statistic by means of a probability model.

An Important Difference The law of large numbers describes what would happen if we took random samples of increasing size n. A sampling distribution describes what would happen if we took all random samples of a fixed size n.

Examining the Sampling Distribution Shape: It appears to be Normal. Center: The mean of the 1000 x s is 24.95, very close to the population mean µ = 25. Spread: The s.d. of the 1000 x s is 2.217, much smaller than the population s.d. σ = 7.

A General Fact When we choose many SRSs from a population, the sampling distribution of the sample means is centered at the mean of the original population. But the sampling distribution is also less spread out than the distribution of individual observations.

More Precisely Suppose that x is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ. Then the sampling distribution of x has mean µ x and standard deviation σ x = σ/ n. Note that µ x = µ. This notation is simply to tell the difference between the two distributions. Because the mean of the sampling distribution of the statistic x, µ x is equal to µ, we say that the statistic x is an unbiased estimator of the parameter µ.

Unbiased Estimators An unbiased estimator is correct on the average over many samples. Just how close the estimator will be to the parameter in most samples is determined by the spread of the sampling distribution. If the individual observations have s.d. σ, then sample means x from samples of size n have s.d. σ/ n. Thus, averages are less variable than individual observations.

For a Normal Population If individual observations have the distribution N(µ, σ), then the sample mean x of an SRS of size n has the distribution N(µ, σ/ n).

Seeing is Believing

Non-Normal Distributions? We know what the values of the mean and standard deviation of x will be, regardless of the population distribution. But what can be known about the shape of the sampling distribution? Population Distribution Sampling Distribution is Normal. is Normal. Population Distribution Sampling Distribution is not Normal. is?????.

Central Limit Theorem Remarkably, as the sample size of a non-normal population increases, the sampling distribution of x changes shape. In fact, the sampling distribution starts to look more like a Normal distribution regardless of what the population distribution looks like. This idea is the Central Limit Theorem.

The Official Definition Draw an SRS of size n from any population with mean µ and standard deviation σ. When n is large, the sampling distribution of the sample mean x is approximately Normal: x is a random variable with distribuition (roughly) N(µ, σ/ n)

So Why Do We Care? The Central Limit Theorem allows us to use Normal probability calculations to answer questions about sample means, even if the population distribution is not Normal.

Central Limit in Action (a) Strongly skewed population distribution. (b) Sampling distribution of x with n = 2. (c) Sampling distribution of x with n = 10. (d) Sampling distribution of x with n = 25.

Warning! The CLT applies to sampling distributions, not the distribution of a sample. Now I m confused. Larger sample size more Normal distribution of a sample. Skewed population will likely have skewed random samples. The CLT only describes the distribution of averages for repeated samples.

Sample Sizes 1 How large does the sample need to be for the sampling distribution of x to be close to Normal? The answer depends on the population distribution. Farther from Normal More observations per sample needed

Sample Sizes 2 General rule of thumb for sample size n: Skewed populations Sample of size 25 is generally enough to obtain a Normal sampling distribution. Extremely skewed populations Sample of size 40 is generally enough to obtain a Normal sampling distribution.

Sample Sizes 3 Angle of big toe deformations in 28 patients. Population likely close to Normal, so sampling distribution should be Normal.

Sample Sizes 4 Servings of fruit per day for 74 adolescent girls. Population likely skewed, but sampling distribution should be Normal due to large sample size.

CLT and Sampling Distributions There are a few helpful facts that come out of the Central Limit Theorem. These are always true, regardless of population distribution. Means of random samples are less variable than individual observations. Means of random samples are more Normal than individual observations.

Sampling Distributions for Probabilities We have seen that sampling distributions are useful for analyzing the means of quantitative variables. But what if we have a categorical variable instead? Fortunately, we can use the sampling distribution of ˆp.

Probability and Categorical Variables Categorical variables can take any of a finite number of possible outcomes. We choose one such outcome and call it a success. All other outcomes are then non-successes or failures. Note: This is an arbitrary choice, not a moral judgment.

Terminology An experiment finds that 6 of 20 birds exposed to an avian flu strain develop flu symptoms. We say the random variable X = the number of birds with flu symptoms. Recall: X is a count of the successes of this categorical variable in a fixed number of observations.

Terminology If the number of observations is labeled as n, then the sample proportion is ˆp = count of successes in sample size of sample = X n Similar to the sample average x, we can find the sampling distribution for ˆp.

Recall: Binomial Distribution As we saw last week, a binomial distribution consists of n observations and constant probability of success p for each observation. Here we will rely heavily on the fact that the binomial distribution (which is discrete) can be approximated by a Normal distribution.

Recall: Normal Approximation to Binomial Distribution Suppose a count X has a binomial distribution with n observations and success probability p. When n is large, the distribution of X is approximately Normal with distribution N(np, np(1 p)) As a rule of thumb, n should be large enough for the count of successes and failures to be at least 10 each.

Sampling Distribution of a Sample Proportion A count of successes has limited use when comparing different studies (as the sample sizes may differ drastically). Instead if we consider the sample proportion ˆp as our preferred sample statistic, this is much more informative. How good is the statistic ˆp as an estimate of the parameter p? Again we ask: What happens with many samples?

The Official Definition Choose an SRS of size n from a large population that has proportion p of successes. Let ˆp be the sample proportion of successes, Then: ˆp = count of successes in the sample n The mean of the sampling distribution is p. The standard deviation of the sampling distribution is p(1 p)/n. As the sample size increases, the sampling distribution of ˆp becomes approximately Normal.

Summary in Picture Form

Warning! Do not use the Normal approximation for the sampling distribution of ˆp when the sample size is small. Also, the population should be much larger than the sample. We ll say, at least 20 times larger, as a rule of thumb. This approximation is least accurate when p is close to 0 or 1. (Our sample would contain only successes or failures unless n is very large.)

Example: Who Gets the Flu? Suppose that we know that 2.5% of all American adults were sick with the flu on a given day of January 2010. The Gallup-Healthways survey interviewed a random sample of 29,483 people and asked them this question. What is the probability that at least 2.3% of such a sample would answer yes in the survey?

Example: Who Gets the Flu? The population proportion is about p = 0.025 and n = 29, 483. So the sample proportion ˆp has mean 0.025 and standard deviation p(1 p) = n (0.025)(0.975) 29, 483 = 0.00091

Example: Who Gets the Flu? We want the probability that ˆp is 0.023 or greater. First we standardize ˆp and call the corresponding statistic z. z = ˆp 0.025 0.00091 Now finish the calculation. P(ˆp 0.023) = P ( ˆp 0.025 0.00091 = P(z 2.20) = 1 0.0139 = 0.9861 ) 0.023 0.025 0.00091

Example: Who Gets the Flu? There is a more than 98% chance that any sample the Gallup-Healthways survey conducts will contain at least 2.3% who say yes.