Study and research skills 2009. and Adrian Newton. Last draft 11/24/2008
Inference about the mean: What you will learn Why we need to draw inferences from samples The difference between a population and a sample An intuitive understanding of the properties of a sample The sampling distribution of the mean The standard error of the mean Confidence intervals Large sample confidence intervals Small sample confidence intervals
Infering the mean Inference Statisticians try to make certain statements about uncertain quantities. We can t know everything about the world. Some properties that we are interested in must be inferred. This is a very difficult concept both to use and accept. Probability is non-intuitive and often misunderstood.
Infering the mean Sometimes we just have to do our best! I am not going to give you a number for it because it s not my business to do intelligent work. What I told him was not necessarily accurate. It might also not have been inaccurate, but I m disinclined to mislead anyone. Donald Rumsfeld when -asked to estimate the number of Iraqi insurgents while testifying before Congress
Infering the mean Making inferences about a mean If we want to know a mean why don t we just calculate it? Didn t we see that last time? It is not quite so simple. The problem arises when we want to know the mean of a population. We usually only have a single sample from that population. If we draw a different sample we get a different mean. So although we know what the mean of a sample is we can only estimate the mean of a population. It is a known unknown
Populations What is a population There is another tricky issue to resolve. What do we mean by a population? This can become a philosophical question A pragmatic way of looking at is that a population is anything we really want to know about by drawing a sample. A population could be finite or infinite For example The population might be all the pine trees in a 5 ha wood in the New Forest. We want to estimate the mean diameter of the trees growing at that site. Or... the population might be Pinus sylvestris. We might be interested in the mean needle length for the species. In the first case we could measure every single tree if we had time In the second case we can only ever get a sample. We can t measure all the members of an infinite population!
An example An example Imagine we have 100 trees in a forest. Their basal areas (area in cross section in cm 2 ) are taken from a theoretical normal distributed with mean= 50 and sd= 20 The population could be either the 100 trees (there are no more) or it could be treated as effectively infinite (Notice that the second assumption is not practical in this case) Put that to one side for a moment and look at the trees.
An example One hundred trees [
If only... If only we knew the truth! If we could really know that the tree basal areas were taken from a normal distribution with mean = 50 and sd = 20 we wouldn t need to measure any of them. About 68% of the values would lie between 30 and 70 About 95% of the values would lie between 10 and 90 We know the mean itself. It is 50. No uncertainty at all.
If only... If only we knew the truth! If there are only 100 trees (finite population) we could also become certain about this particular population of trees. We could measure every single tree with extremely precise instruments. We could then calculate the mean and sd for the population of 100 trees.
If only... The theoretical infinite population (red line) and the empirical finite population (histogram) pdf 2
If only... The finite population parameters The mean is 48.245 The standard deviation is 20.154 Not far from our theoretical vales (50 and 20)
Simulating sampling Sampling It takes about three minutes to measure a tree s diameter accurately plus walking time between each What if we only have one morning available? We might only manage to measure thirty of the hundred trees. What can we then say about the mean of the hundred trees?
Simulating sampling A representative sample It would not be a good idea to measure only the trees on the edge of the forest (they get more light and might be bigger) We should aim for a representative sample This could be obtained by randomly selecting trees
Simulating sampling A random sample
Simulating sampling A representative sample? pdf 2
Simulating sampling In fact it is completely representative We can never expect the histogram of a relatively small sample to look classically normal We can implement a test to find if it could have been drawn from a normal distribution Most small samples could. This was. Small samples rarely look that normal
Sample properties The sample properties The mean is 46.546 The standard deviation is 21.193 But what happens if we send someone else back to draw another random sample?
Sample properties Another random sample
Sample properties A different sample s properties The mean is 43.254 The standard deviation is 18.717 And what happens if we send another person back to draw another random sample?
Sample properties Another random sample
Sample properties Another sample s properties The mean is 50.726 The standard deviation is 18.95 And what happens if we send yet another person back to draw another random sample?
Sample properties Another random sample
Sample properties Another sample s properties The mean is 47.578 The standard deviation is 18.394 And so on and so on...
Sample properties Another random sample
Sample properties Another random sample
Sample properties Where is this taking us? We never really would take repeated samples from the same population in this way However frequentist statistical theory is based on this idea. There is something very interesting about the properties of repeated samples Let s get the computer to do this 1000 times and look at the result.
Sample properties The sampling distribution of the mean Histogram of the population basal areas Frequency 0 10 20 20 40 60 80 Histogram of the mean values of 10000 samples of 30 trees Frequency 0 100 200 20 40 60 80
Sample properties The standard error of the mean The means of our repeated sampling experiments form a much tighter distribution than the data. We can find the mean of the means if we want. It is 48.149 The mean of the finite population of one hundred trees is 48.245 They are close. We can also find the standard deviation of our hypothetical set of sampling experiments. It is 3.053 The standard deviation of the hundred trees is 20.154. They are not close. The standard deviation of the mean is always less than the standard deviation and it decreases with sample size.
Sample properties 1000 samples size 2 Frequency 0 200 400 600 800 1000 1200 1400 20 40 60 80 replicate(10000, mean(sample(d$a, 2)))
Sample properties 1000 samples size 4 Frequency 0 50 100 150 200 20 40 60 80 replicate(1000, mean(sample(d$a, 4)))
Sample properties 1000 samples size 10 Frequency 0 50 100 150 200 250 300 20 40 60 80 replicate(1000, mean(sample(d$a, 10)))
Sample properties 1000 samples size 20 Frequency 0 50 100 150 20 40 60 80 replicate(1000, mean(sample(d$a, 20)))
Sample properties 1000 samples size 50 Frequency 0 50 100 150 200 20 40 60 80 replicate(1000, mean(sample(d$a, 50)))
Sample properties 1000 samples size 90 Frequency 0 50 100 150 200 250 300 20 40 60 80 replicate(1000, mean(sample(d$a, 90)))
Sample properties So what use is this? No one would be stupid enough to waste time this way. But the idea is still very useful. It turns out that if we know the standard deviation of a population we know what the standard deviation of the mean will be for any sample size. SD x = σ n Where SD x represents the true standard deviation for the means and σ is the population standard deviation and n is the sample size.
Standard error The standard error But, that is no use to us! We don t know the population standard deviation unless we measure all the trees! We re back where we started. Fortunately we do have an estimate of it. We can get one every time we take a sample. It is the sample standard deviation. It won t be quite right and it also varies between samples. In the case of small samples it might even be quite hopeless, but if we only measure some of the trees once, its all we ve got. We can call the standard deviation of the mean calculated from the sample standard deviation the standard error. SE x = s n
Large samples Inference from large samples If the sample is large (n>30) we might safely assume that our sample standard deviation s is more or less equal to σ We can also assume that our standard error is pretty close to actually being the standard deviation of the mean. Now, this is beginning to look more useful. We already know all about standard deviations from last time.
Large samples 68 percent of observations lie within 1 sd of the mean 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x
Large samples 95 percent of observations lie within 2 sds of the mean 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x
Large samples Confidence intervals So, we can imagine what a histogram of the repeated sampling experiments would look like without having to do them. Their means would form a normal distribution with standard deviation= standard error. A 95% confidence interval can be calculated from the sample mean plus or minus two standard errors (or 1.96 standard errors if we want to be really fussy) x±2.se x It will include the true population mean 95% of the time.
Large samples Calculating a confidence interval: A random sample to try it out on BA 1 48.60 2 39.39 3 86.50 4 52.60 5 21.42 6 87.95 7 34.47 8 78.57 9 49.20 10 43.95 11 22.02 12 47.85 13 66.35 14 26.85 15 42.42 16 63.20 17 34.53 18 63.78 19 41.83 20 58.38 21 52.68 22 27.73 23 46.98 24 48.15 25 55.09 26 28.25 27 33.22 28 26.43 29 74.80 30 98.91
Large samples Get the computer to do the work Sample mean = 50.07 Sample standard deviation = 20.39 Standard error = 3.722 95% confidence interval for mean = 50.07 ±7.44 What was the true population mean? 48.25 It falls inside the interval! And it should do so 19 times out of 20.
Large samples What can make us more confident? The smaller the confidence interval the more precise is our estimate. Remember that if the sample was biased it could still be inaccurate. We can improve precision by Measuring something with intrinsically low variability In ecology most variability is naturally part of the system In this case taking very precise measurements of each element in a sample could be a bit of a waste of time Taking a large sample. But remember the denominator is the square root of the sample size Decreasing the uncertainty in your mean value estimate by a factor of two needs four times as many samples. Decreasing uncertainty by a factor of ten requires a hundred times as many samples.
Small samples Small sample inference Remember that the sample standard deviation varies between samples It varies more for small samples So as sample size decreases the estimate becomes worse The use of the t statistic is designed to control for this
Small samples What is the t distribution A t distribution with a large number of degrees of freedom (large sample) is the same as a normal distribution. So large sample inference using the t distribution is the same as using the z values of a normal distribution (±2SEs for 95% confidence intervals) However as the sample size gets smaller the t distribution gets longer tails
Small samples T distribution with 30 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x
Small samples T distribution with 20 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x
Small samples T distribution with 10 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x
Small samples T distribution with 5 degrees of freedom function(x) dt(df = n 1, x) (x) 0.0 0.1 0.2 0.3 4 2 0 2 4 x
Small samples T distribution with 2 degrees of freedom function(x) dt(df = n 1, x) (x) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 4 2 0 2 4 x
Small samples So how do we use this? Look up the number that corresponds to the 97.5 percentile of the t-distribution (A) for the number of degrees of freedom you have (n-1) Use this instead of the previous rule of thumb number (2) to calculate your confidence intervals.
Small samples Values of the 97.5 percentile for a range of degrees of freedom df qt 1 1 12.71 2 2 4.30 3 3 3.18 4 4 2.78 5 5 2.57 6 6 2.45 7 7 2.36 8 8 2.31 9 9 2.26 10 10 2.23 11 11 2.20 12 12 2.18 13 13 2.16 14 14 2.14 15 15 2.13 16 16 2.12 17 17 2.11 18 18 2.10 19 19 2.09 20 20 2.09 21 21 2.08 22 22 2.07 23 23 2.07 24 24 2.06 25 25 2.06 26 26 2.06 27 27 2.05 28 28 2.05 29 29 2.05 30 30 2.04
Small samples Calculating a confidence interval using the t distribution: A small sample to try it out on BA 1 46.84 2 53.45 3 41.83 4 42.42 5 33.35
Small samples Get the computer to do the work again Sample mean = 43.58 Sample standard deviation = 7.37 Standard error = 3.295 97.5 percentile of t distribution for 4 degrees of freedom = 2.78 95% confidence interval for mean = 43.58 ±9.15 What was the true population mean? 48.25 It falls inside the interval! And it also should do so 19 times out of 20.
Small samples Assumptions for calculating confidence intervals Both small sample and large sample confidence intervals assume that the data are drawn from a normally distributed population. We will look more carefully at this assumption later Perhaps more importantly the sample must be representative of the population from which it has been drawn.
Small samples What have we covered The fundamental basis of statistics Understand this class and you ve passed the course! We ve learnt how to guess a defensible range of values for a number we don t know with certainty We use what we do know to make statements about what we don t. Don Runsfeldt would approve.
Small samples What you need to remember You must remember all the formulas from class 2. The additional concept is the standard error of the mean. This is the sample standard error divided by the square root of the number of observations in the sample. Multiply this by two to get a 95% confidence interval for sample size >30 Multiply it by a number greater than two (that you get from the t-distribution) for sample sizes <30