Confidence Intervals 1

Size: px

Start display at page:

Download "Confidence Intervals 1"

Reynard Wheeler
5 years ago
Views:

1 Confidence Intervals 1 November 1, HMS, 2017, v1.1

2 Chapter References Diez: Chapter 4.2 Navidi, Chapter 5.0, 5.1, (Self read, 5.2), 5.3, 5.4, 5.6, not 5.7, 5.8 Chapter References 2

3 Terminology Point Estimate A sample statistic used to estimate the value of a population parameter. 1. Provides a single value: Based on observations from 1 sample, there is no sampling distribution. 2. Information: Gives no information about how close the value is to the unknown population parameter. 3. Example: Sample mean x is the point estimate of the unknown population mean. Confidence Intervals 3

4 Terminology Confidence interval (interval estimate) A range of values defined by the confidence level within which the population parameter is estimated to fall. 1. Provides a range of values. 2. Information: Gives information about closeness to unknown population parameter 3. Example: Unknown population mean lies between 50 and 70 with 95% confidence Confidence Level The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter. Confidence Intervals 4

5 Interval Estimation A probability that the population parameter falls somewhere within the interval. µ x = X ± error Confidence Intervals 5

6 Error Bars We ve all seen errors bars: One standard deviation from the mean of nine independently evolved populations. Designing and engineering evolutionary robust genetic circuits Journal of Biological Engineering 2010, 4:12 Confidence Intervals 6

7 Error Bars Error Bar Type Description Formula Range Descriptive Distance between extremes Highest and of data lowest points Standard Deviation Descriptive Average deviation SD from the mean Standard Error Inferential Variability of the mean SD/ n Confidence Limit Inferential Range of values you can be 95% confidence contains the true mean To be determined Confidence Intervals 7

8 Confidence Limits Recall what the standard error means: The standard error gives the standard deviation of the distribution of the mean. We are 68% confident that the mean is within the limits ±SE. But we could do better, we could widen the limits to 95%. Confidence Intervals 8

9 Confidence Limits The standard error of the mean can be interpreted as: If we were to take another sample from the population and compute its mean, there is a 68% chance that the mean of the sample will lie within 1 standard error of the population mean. But 68% is not that big, better to use a wider range - a given range is referred to as the confidence level. Confidence Intervals 9

10 Confidence Limits Confidence Level: The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter. 95% confidence level there is a 0.95 probability that a specified interval does contain the population mean. In other words, there are 5 chances out of 100 (or 1 chance out of 20) that the interval does not contain the population mean. Two σ 99% confidence level there is 1 chance out of 100 that the interval does not contains the population mean. Three σ Confidence Intervals 10

11 Confidence Limits Confidence Intervals 11

12 Confidence Limits How far out is 95% of the area on a normal curve? This means there is a 2.5% area on both sides of the normal curve. Let α = 0.05 i.e 5% Confidence Intervals 12

13 Confidence Limits To find the z-value we look up 97.5 in the standard normal cumulative probability table and we get 1.96 Confidence Intervals 13

14 Confidence Limits z Value Percentage Area % % % % Confidence Intervals 14

Confidence Limits In other words, -1.96 to 1.

15 Confidence Limits In other words, to 1.96 on the z scale represents 95% of the area: or (and this is the critical point) 95% is bounded by ±1.96 SE Confidence Intervals 15

16 Confidence Limits We can therefore state that for 95% of the time, the mean will be bounded by: µ ± 1.96 σ n However, we don t actually have the population mean, µ, or the population standard deviation, σ. Instead we use the sample mean and standard deviation as proxies: x ± 1.96 s n Hence the limits are actually only approximate. Moreover the approximation get worse as the sample size gets smaller. Confidence Intervals 16

17 Confidence Limits You ll also see the confidence level expressed as: z α/2 for 95% confidence, 1.96 z α/3 for 99.7% confidence, 3 Note: The 99.0% level is actually bounded by The z score changes rapidly at the limits of the normal curve. Confidence Intervals 17

18 Exercise Find the z limits for a 90% confidence level. Confidence Intervals 18

19 Example The mean birth weights for 200 babies is 3.28 Kg grams with a population standard deviation of 0.85 Kg. Compute the 95% confidence limits for the mean birth weight. x = 3.28 ± = 3.28 ± 0.12 Kg Confidence Intervals 19

20 Example The mean concentration for a sample of 100 insulin vials is 15 grams/vial with a population standard deviation of 3.4 grams. Compute the 90% confidence limits for the mean concentration of insulin. x = 15 ± = 15 ± 0.56 grams/vial Confidence Intervals 20

21 Summary Large sample confidence interval for a population mean: General Formula: x ± (z critical value) σ n Levels of confidence and corresponding z critical value: 99% % % Since n is large the unknown σ can be replaced by the sample value s: x ± (z critical value) s n Confidence Intervals 21

22 Problems in Paradise You may have noticed in the previous examples that the samples were relatively large. This was to ensure that the means and standard deviations were reasonable representatives of the population measures. In fact the problems stated that the standard deviation was in fact the population standard deviation. For small samples 30 we have to make a slight modification to the procedure. Confidence Intervals 22

23 Student s t Distribution For large samples we assume that the means are distributed as: X N(µ, σ 2 /n) and that the standardized distribution (X µ)/(σ/ n) has a normal distribution with mean 0 and variance 1. However, when the sample size is small, there is significant error in the estimate for the population standard deviation, σ, because we will often estimate the population standard deviation from the sample standard deviation, s. In 1908 Gossett proposed that the quantity (X µ)/(s/ n) was in fact distributed via a different distribution which he called the Student s t distribution. Confidence Intervals 23

24 Student s t Distribution Definition: Let X 1,... be a small sample (n 30) from a normal population with mean µ. Then the quantity: X µ s/ n has a Student s t distribution with n 1 degrees of freedom, denoted, t n 1 Note: The t distribution is a function of the sample size minus 1. Confidence Intervals 24

25 Confidence Limits For small sample sizes, use the t distribution instead of the standard normal distribution. The t distribution is a symmetrical distribution whose probability density function is defined by a single parameter known as the degrees of freedom (df). Example, if the sample size if 19, then the degrees of freedom will be 18. Confidence Intervals 25

26 Confidence Limits A larger portion of the probability area is in the tails compare to the standard normal distribution. This in turn means the confidence limits computed using the t distribution will be larger. Confidence Intervals 26

27 Confidence Limits for Small Samples The confidence limits for a small sample is given by: s X ± t n 1,α/2 n where n is the sample size, n 1 the degrees of freedom, α/2 the confidence level (eg 0.05/2 = 95%) Just as there are z tables there are also t tables Confidence Intervals 27

28 t Tables Confidence Intervals 28

29 t Tables Confidence Intervals 29

30 t Tables: One Tailed and Two Tailed Tables Confidence Intervals 30

31 t Tables: How to use the t Table Confidence Intervals 31

32 t Tables: Example Six vials of penicillin were randomly selected and the concentration of penicillin was determined in each vial in mg/ml to be: 8.6, 9.7, 13.4, 11.4, 10.2, 12.3 Find the 95% confidence limits for the true mean concentration of penicillin. Confidence Intervals 32

33 t Tables: Example Since we are dealing with less than a sample size of 30, we will use the t-statistic to determine the confidence limits. n = 6 x = Sample standard deviation = 1.77 X is the random variable that represents the mean. s X ± t n 1,α/2 n Confidence Intervals 33

34 t Tables: Example n 1 = 6 1 = 5 α/2 = 0.05/2 = Note: α is the area outside the critical bounds = 0.05 In the t table we will look for row 5, and the column marked Note that 0l025 is the areas of a single tail therefore we will use the single tailed table. Confidence Intervals 34

t Tables: Example Therefore: t 5,α/2=0.025 = 2.571 µ = 10.93 ± 2.571 1.77 6 = 10.93 ± 1.

35 t Tables: Example Therefore: t 5,α/2=0.025 = µ = ± = ± 1.86 For the z-statistic the critical value would be 1.96, therefore the range has widened with the t-statistic. Confidence Intervals 35

36 Confidence Intervals for Difference between two Means Consider two samples from two different populations. What is the confidence limit for the difference in the two means? i.e µ X µ Y From previous lectures on combining means and variances we know that: X Y N(µ X µ Y, σ 2 X + σ 2 Y ) That is the difference is also normally distributed but with different mean and variance. Confidence Intervals 36

37 Confidence Intervals for Difference between two Means Consider a mean X and Y that have standard errors: σ X n1 and σ Y n2 where n 1 and n 2 are the sizes of the corresponding samples. Then the difference X Y : Mean = µ X µ Y and variance σ 2 X n1 + σ2 Y n2 Given the new variance and mean we can compute the 95% confidence limit as: σx 2 µ X µ Y ± σ2 Y n1 n2 Confidence Intervals 37

38 Confidence Intervals for Difference between two Means If the samples are small then the confidence limits for the difference in two means requires the use of the t distribution as before. The main complication is the calculation of the degrees of freedom but its difficult to do. I refer you to section 5.6 in Navidi for details. Confidence Intervals 38

39 Using Simulation to Estimate Confidence Limits All the methods so far assume the sample is obtained from a population that is normal or near normal. What happens if the population is not normal? Confidence Intervals 39

40 Using Simulation to Estimate Confidence Limits Consider a series of gene expression rates measures from seven cultures of E. coli. The data are as follows: 7.69, 4.97, 4.56, 6.49, 4.34, 6.24, 4.45 Using the following code a norm Q-Q or probability plot was made: import numpy as np import pylab import scipy.stats as stats measurements = [7.69, 4.97, 4.56, 6.49, 4.34, 6.24, 4.45 ] stats.probplot(measurements, dist="norm", plot=pylab) pylab.show() Confidence Intervals 40

41 Using Simulation to Estimate Confidence Limits The data does not appear to be normally distributed. Confidence Intervals 41

42 Using Simulation to Estimate Confidence Limits To find 95% confidence intervals for this data we must create synthetic data sets using a Bootstrap. We will create a new sample by drawing at random (and with replacement) values from the measured sample. For example the following could be a new synthetic sample: 6.49, 4.97, 4.34, 6.24, 7.69, 6.49, 4.34 Because of replacement it is possible we could pick the same value multiple times. We do this 100,000 times in order to create 100,000 synthetic data sets. We compute the mean for each synthetic data set to generate a population of means. We can use this population to work out the distribution and hence the confidence limits on the mean. Confidence Intervals 42

43 Using Simulation to Estimate Confidence Limits Sort the sample means from low to high Find the 2.5% and 97.5% percentiles. The interval will contain 95% of the data. These calculations can be easily done in Python. Confidence Intervals 43

44 Using Simulation to Estimate Confidence Limits import numpy as np import scipy.stats as stats import random measurements = [7.69, 4.97, 4.56, 6.49, 4.34, 6.24, 4.45 ] ensemble = [] for i in range (100000): sample = [] for j in range (7): sample.append (measurements [random.randint (0,6)]) ensemble.append (np.mean (sample)) np.sort (ensemble) p1 = np.percentile(ensemble, 2.5) p2 = np.percentile(ensemble, 97.5) print p1, p2 Confidence Intervals 44

45 Using Simulation to Estimate Confidence Limits Running the python script yields the values: That is the 95% confidence limits on the gene expression is: 4.72 to 6.46 Confidence Intervals 45

Review. One-way ANOVA, I. What s coming up. Multiple comparisons

Review. One-way ANOVA, I. What s coming up. Multiple comparisons Review One-way ANOVA, I 9.07 /15/00 Earlier in this class, we talked about twosample z- and t-tests for the difference between two conditions of an independent variable Does a trial drug work better than