Sampling distributions and estimation. 1) A brief review of distributions: We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p). (The symbol => means implies ) Other binomial examples: Pr{3 mutants in a sample of 5} Pr{8 people with blood type O in a sample of 20} In all cases Y = number of successes, and Y ~ BIN. We re interested in Pr{Y 120} where Y is IQ of a randomly chosen person. It turns out that IQ is normally distributed, and so Y ~ N(μ,σ). Other normal examples: Pr{Y y} where y is a specific blood oxygen level. Assuming blood oxygen level is normally distributed (actually, it probably isn't). Height - a classic example of a normally distributed random variable. So if Y = height => Y ~ N(μ,σ) Of course, there are many other distributions: Poisson, Uniform, etc, and If, for example, we have something we know has a Poisson distribution (e.g, the number ticks on a dog) we could write Y ~ Poi(μ). (The Poisson distribution has only one parameter, μ.) 2) But this is all review; what's the point here? Often, in statistics, we're not interested in the probability of an individual number, (the examples above). Instead, we are interested in the distribution of the parameter estimates (or sometimes functions of the parameter estimates). This sounds confusing, but here's an example: What is the distribution of the sample mean, y? (We can also ask about the distribution of s 2 or p, but we won't do much with the distributions of these in this course).
Why is this interesting? because if we know the distribution of y, we can ask question like: What is the probability that the average height of men is less than 68 inches? Or better: What is the probability that the average blood pressure with medication is lower than the average blood pressure with a placebo. We can start asking (and answering!) questions that are important in biology, medicine, and elsewhere! So, to summarize, a sampling distribution is a distribution of one of our parameter estimates, such as y, s 2 or p, etc. 3) Another way of looking at it (similar to book): Let s do an experiment many times (for example, we take a sample many times). The we ask: Each time we calculate the sample mean. So we'd have ȳ 1, ȳ 2, y 3... ȳ m. What is the distribution of these m sample means? But we also want to know: What is the population mean of all these sample means? What are the population variance and population standard deviation of these sample means? (And, for that matter, how do we estimate these using y and s 2?) As it turns out, we don t need to do m number of experiments for this to work. Instead we can use some mathematics and figure it out. The math is similar to that used for the mathematical definition of a population mean (in other words, we use expected values to calculate this). We won t go into the details here. 4) So we need to know the answers to the above questions. Without doing the math, here they are: The (theoretical or population) mean for y is given as follows: μ y = μ The (theoretical or population) standard deviation for y is given by: σ ȳ = σ n
Are you surprised by the first result? Hopefully not. It makes a certain intuitive sense that if: y estimates μ, then y of a bunch of y 's also estimates μ. The second result is more difficult to understand (it's not quite intuitive). But think about the following: If you use a bunch of y 's, you would expect them to be less spread out than the original data. In other words, you would expect the variance (or standard deviation) for y to be less than the variance (or standard deviation) for y. Dividing by n lets us do that. (The y 's are closer to each other than the y's). And remember, that we could do some math and show that this is actually the case. Here's a graphical representation of what's going on:
All of this now leads us directly into the next question. So what about the actual distribution of Y? If Y is normal, Y is normal (again, hopefully no surprise). If Y is not normal, then if n is large enough, Y is still approximately normal! This answer is due to the Central Limit Theorem (CLT). This is one of the reasons that the normal distribution is so important to statistics. (There are some theoretical exceptions to this theorem, but in practice they're usually not that important). In other words: Take almost any distribution at all. Take the average, ȳ. The average, y will be normally distributed if n is large enough. (What is a large enough n? We'll answer that question later when we start looking at the t- distribution.) So, to summarize, what does all this mean? That we can use our normal tables to calculate probabilities associated with Y. For example, we can answer questions like: Pr{ Ȳ < 68 inches} or Pr{ Ȳ 1 < Ȳ 2 } where Ȳ 1 = average blood pressure on medication and Ȳ 2 = average blood pressure on placebo (In practice, it turns out that we'll usually wind up using t-tables (i.e., the t-distribution) since our n will often not be large enough, but more on that next time). Questions like these are much more interesting than the ones we've been asking until now. This will lead us directly into confidence intervals and then (finally!) statistical hypothesis tests. These concepts are really important, so if you didn't follow everything in class, you might want to reread the notes and ask questions.
5) Enough theory. Let's see how to use sampling distributions. Let's look at human heights (we're trying to do something intuitive). If we look at the height of women, the average height is about 64 inches, and the standard deviation is about 3 inches. If we assume that these represent μ and σ (we don't really know these), then we have: μ = 64 inches and σ = 3 inches We know how to calculate simple probabilities, so let's calculate the probability that a single woman is more than 6 feet (= 72 inches) tall. Since height is normally distributed, we can simply do: Pr{Y > 72} = Pr{Z > 2.67} = 0.0038 We have a small but real probability. Now what happens if we randomly choose 5 women and want to know the probability that their average height is more than 6 feet? We want: Pr{ Y > 72} We can calculate z, but this time we need to be a little careful in our denominator. We need to do: z = ȳ μ ȳ σ ȳ = ȳ μ ȳ σ / n In other words, we need to use μ and σ for y. What is the (population) mean of Y for our five women? What is the population standard deviation? μ ȳ = μ = 64 inches σ ȳ = σ n = 3 = 1.342 inches 5 So now we can calculate our probability as follows: z = ȳ μ ȳ σ ȳ = 72 64 1.342 = 5.96 so we have Pr {Ȳ >72}= Pr {Z >5.96} = 1.26119e-09 Notice the absurdly small value for the probability (we had to use R to get it). This is because it is extremely (extremely!) unlikely that any sample of 5 women will have an average higher than 6 feet.
6) Standard Error (SE) You should realize that sample size can change probabilities drastically compared to just taking a simple individual (n = 1). We won't explain all of this right away, but let's start to worry about the fact that we almost never know μ and σ. So what do we do instead? How can we answer probability questions if we don't know μ and σ? We won't learn the answer to that yet, but it should be obvious that somehow we need to use our estimates ( y and s 2 ) instead of μ and σ. We estimate μ with y, and σ with s. We ve discussed this before. The fact that we have to estimate σ with s will cause us some problems later, but let's not worry about it yet. (The fact that we don't know μ or σ will change our sampling distribution.) So what happens to the standard deviation for Ȳ?? Instead of using σ we use s. In other words, we estimate: n with s n This quantity is also known as the Standard Error (in this case for y ): SE y = s n So what does the Standard Error tell us? Basically how reliable our y is as an estimate of the mean. If the SE is small, our y is a pretty good estimate; if it s big, our y is not so good an estimate. (More on this next time). Important: the Standard error is not always the same; it depends on what you're interested in: If you're interested in y, then it's defined as above. You can also have a standard error for proportions, for the differences in means, and many other quantities. We'll encounter some of these later.
Let's finish this set of notes by looking at the effect of sample size on some of the quantities we've discussed. We'll use the example for the heights of women above. Using R, here are some simulated data for various sample sizes assuming μ = 64 inches and σ = 3 inches: estimate n = 10 n = 100 n = 1000 n y 64.51 64.16 63.97 y μ s 2.60 2.91 2.96 s σ SE 0.82 0.29 0.09 SE 0 As the sample size approaches infinity, we know more and more about the population. If we had perfect knowledge (if we measured all women), then we really would know the values of μ and σ. Similarly, since the SE measures uncertainty in our y, it goes to 0 (if we know μ then y = μ, and we no longer have any uncertainty). Remember - Y is random, μ is an (unknown) constant.
Remember: μ and σ are not random variables They don't change, and are obviously not affected by sample size! y and s 2 are random variables They can change depending on what's in your sample, but usually don't change drastically with sample size. SE is also random SE is affected by sample size (it get's smaller, which is easy to see from the equation above). You should make sure you thoroughly understand this example!