Introduction to Statistical Data Analysis Lecture 4: Sampling

Size: px

Start display at page:

Download "Introduction to Statistical Data Analysis Lecture 4: Sampling"

Rose Rice
5 years ago
Views:

1 Introduction to Statistical Data Analysis Lecture 4: Sampling James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 30

2 Introduction In order to complete the transition from descriptive statistics to inferential statistics, we need to know how to work with a sample of a population, since in many cases gathering descriptive statistics from the entire population is impractical. Therefore, in this lecture, we discuss sampling techniques. James V. Lambers Statistical Data Analysis 2 / 30

3 Simple Sampling Systematic Sampling Cluster Sampling Stratified Sampling Once the determination is made that only a sample of a population of interest can be studied, how to obtain that sample is far from a trivial matter. It is essential that the sample not be biased; that is, the sample must be representative of the entire population, or any inferences made from the sample will not be reliable. To reduce the chance of bias, it is best to use random sampling, which means that every member of the population has a chance of being selected. We now discuss various approaches to random sampling. James V. Lambers Statistical Data Analysis 3 / 30

4 Simple Sampling Systematic Sampling Cluster Sampling Stratified Sampling Simple Sampling In simple sampling, each member of the population has an equal chance of selection. Typically, tables of random numbers are used to assist in such a selection process. For example, suppose all members of the population can be numbered. Then, the table of random numbers can be used to determine the numbers of members of the population who are to be included in the sample. James V. Lambers Statistical Data Analysis 4 / 30

5 Simple Sampling Systematic Sampling Cluster Sampling Stratified Sampling Systematic Sampling Simple sampling is susceptible to bias, if some aid such as a table of random numbers cannot be used. To avoid this bias, one can use systematic sampling, which consists of selecting every kth member of the population. If the population has N members and a sample of size n is desired, then one should choose k N/n. James V. Lambers Statistical Data Analysis 5 / 30

6 Simple Sampling Systematic Sampling Cluster Sampling Stratified Sampling Cluster Sampling In cluster sampling, the population is divided into groups, called clusters, and then random sampling is applied to the clusters. That is, entire clusters are chosen to obtain the sample. This is effective if each cluster is representative of the entire population. James V. Lambers Statistical Data Analysis 6 / 30

7 Simple Sampling Systematic Sampling Cluster Sampling Stratified Sampling Stratified Sampling In stratified sampling, the population is divided into mutually exclusive groups, called strata, and then random sampling is performed within each stratus. This approach can be used to ensure that each stratus is treated equally within the sample. For example, suppose that for a national poll, it was desired to have a sample in which each state was represented equally. Then, the strata would be the states, and a sample could be obtained from the populations of each state. James V. Lambers Statistical Data Analysis 7 / 30

8 Sampling Errors Poor Sampling Technique Sampling must be performed with care, so that any inferences made about the population from the sample have at least some validity. James V. Lambers Statistical Data Analysis 8 / 30

9 Sampling Errors Poor Sampling Technique Sampling Errors A descriptive statistic computed from a sample is only an estimate of the corresponding statistic for the population, which, in most cases, cannot be obtained. However, it is possible to estimate the error in the sample statistic, called the sampling error; we will learn how to do so later, using confidence intervals. As we will see then, choosing a larger sample reduces the sampling error. It can be made arbitrarily small by choosing a sample close to the size of the entire population, but usually this is not practical. James V. Lambers Statistical Data Analysis 9 / 30

10 Sampling Errors Poor Sampling Technique Poor Sampling Technique Even if a very large sample is chosen, conclusions made about the sample do not apply to the population if the sample is biased. On the other hand, if a sample is truly representative of the population, then it does not need to be large to be reliable. It is also important to avoid making unrealistic assumptions about the sample. James V. Lambers Statistical Data Analysis 10 / 30

11 Sampling Errors Poor Sampling Technique 1948 Presidential Election In a poll conducted during the 1948 presidential election, voters in the sample were classified as supporting Harry Truman, supporting Thomas Dewey, or undecided. The polling organization made the assumption that undecided voters should be distributed among the two candidates in the same way that the decided voters were, which led to a conclusion that Dewey would win. However, the undecided voters were actually more in favor of Truman, thus leading to his victory. James V. Lambers Statistical Data Analysis 11 / 30

12 Sampling Distribution of the Mean Suppose that it is desired to measure some quantifiable characteristic of a population, such as average height, or the percentage of the population that votes Republican. A sample of the population can be taken, and then the characteristic of the sample, whatever it is, can be computed from information obtained from each member of the sample. Now, suppose that many samples are taken, with each sample being the same size. Then, the values that are computed from these samples form a set of outcomes, where the experiment in question is the computation of the desired characteristic of the sample. This set of outcomes obtained from samples is called a sampling distribution. James V. Lambers Statistical Data Analysis 12 / 30

13 Sampling Distribution of the Mean Sampling Distribution of the Mean Sampling distributions apply to a number of different statistics, but the most commonly used is the mean. The sampling distribution of the mean is the pattern of means that is obtained from computing the sample means from all possible samples of the population. James V. Lambers Statistical Data Analysis 13 / 30

14 Sampling Distribution of the Mean Example We will illustrate the sampling distribution of the mean for an example of rolling a six-sided die. Each of the six numbers has an equal likelihood of appearing face up, so these values follow a discrete uniform probability distribution, which is a distribution that assigns the same probability to each discrete event. James V. Lambers Statistical Data Analysis 14 / 30

15 Sampling Distribution of the Mean Example, cont d The mean of such a distribution is µ = a + b 2, where a and b are the minimum and maximum values, respectively, of the distribution. The variance is given by σ 2 = 1 12 [(b a + 1)2 1]. Therefore, for the case of a six-sided die, for which a = 1 and b = 6, we have µ = 3.5 and σ 2 = 35/12. James V. Lambers Statistical Data Analysis 15 / 30

16 Sampling Distribution of the Mean Example, cont d Now, suppose we roll the die n times, where n is the size of our sample, and compute the sample mean x. Then, we repeat this process m times, gathering m samples, each of size n. The m sample means form a sampling distribution of the mean, which we can then display in a histogram. James V. Lambers Statistical Data Analysis 16 / 30

17 Sampling Distribution of the Mean Displaying the Sample Means This is accomplished in R using the following statements (assuming the values of n, the sample size, and m, the number of samples, are already defined): > means=c() > for (i in 1:m) means[i]=mean(round(runif(n,0.5,6.5))) > hist(means,seq(1,6,0.5)) James V. Lambers Statistical Data Analysis 17 / 30

18 Sampling Distribution of the Mean Code Dissection The first statement means=c() creates an empty vector called means, which will hold the sample means. The second statement for (i in 1:m) means[i]=mean(round(runif(n,0.5,6.5))) executes a loop m times, in which the ith element of the means vector is set to the mean of a vector of n numbers generated by runif from the uniform probability distribution with a = 0.5 and b = 6.5, and then rounded to the nearest integer by round to generate a sample containing numbers between 1 and 6. James V. Lambers Statistical Data Analysis 18 / 30

19 Sampling Distribution of the Mean Code Dissection, cont d The third statement hist(means,seq(1,6,0.5)) generates a histogram of the frequency distribution of the sample means, with classes chosen to have width 0.5. Recall that the expression seq(a,b,h) generates a sequence of numbers starting at a and ending at b, with spacing h. If the terms of the sequence have a spacing of 1, then the shorthand a:b can be used instead; note that this is used in the for statement. James V. Lambers Statistical Data Analysis 19 / 30

20 Sampling Distribution of the Mean Sampling Distribution of the Mean, n = 2 Suppose we use a small sample of size n = 2, and compute m = 50 samples. The means are well-distributed across the interval from 1 to 6. James V. Lambers Statistical Data Analysis 20 / 30

21 Sampling Distribution of the Mean Increasing the Sample Size Now, suppose that we increase n (keeping m fixed) and see what happens to the distribution. We see that the distribution becomes like that of a normal distribution, with its mean roughly that of the original uniform distribution. James V. Lambers Statistical Data Analysis 21 / 30

22 Standard Error The behavior in the preceding example is no coincidence; it is actually an illustration of what is known as the Central Limit Theorem. This theorem states that as the sample size n increases, the sample means tend to converge to a normal distribution around the true population mean, regardless of distribution of the population from which the sample is taken. James V. Lambers Statistical Data Analysis 22 / 30

23 Standard Error Standard Error of the Mean also states that as the sample size n increases, the standard deviation of the sample means, denoted by σ x, converges to σ x = σ n, where σ is the standard deviation of the population. This standard deviation of the sample means is called the standard error of the mean. Using the standard error σ x and the population mean µ, one can use the fact that the sample mean is normally distributed for sufficiently large n to compute the probability that the sample mean will fall within a certain interval, as has been shown previously for a general normal distribution. James V. Lambers Statistical Data Analysis 23 / 30

24 Standard Error Example In the case of the roll of a six-sided die, with a sample size of n = 20, the standard error is σ x = σ 35/12 = = n 20 Therefore, to obtain the probability that the sample mean will be greater than 4, we compute the z-score for 4: We conclude that 4 µ σ x = = P(X > 4) = 1 P(X 4) = 1 P(Z 1.309) = = That is, there is a less than 10% chance that the sample mean will be greater than 4. James V. Lambers Statistical Data Analysis 24 / 30

25 Standard Error Sampling Distribution of the Sum Suppose that instead of taking the mean of the observations in each sample, we instead take the sum. If the population mean and standard deviation are µ and σ, respectively, then as n increases, the sampling distribution of the sum converges to N (nµ, σ n). That is, the mean and standard deviation of the sampling distribution of the mean are simply multiplied by n James V. Lambers Statistical Data Analysis 25 / 30

26 In addition to the mean, we can measure the proportion of the population that possesses a characteristic that is binary in nature, such as whether a person agrees with a particular statement. Because of the binary nature of the characteristic, the experiment of determining its value for members of the population follows a binomial distribution. That is, the act of inquiring of each member of the population is a Bernoulli trial, in which success and failure correspond to yes or no responses. However, as noted previously, if the number of trials n is sufficiently large that np 5 and n(1 p) 5, where p is the probability of success, then this binomial distribution can be approximated by a normal distribution. James V. Lambers Statistical Data Analysis 26 / 30

27 Standard Error of the Proportion We therefore need the mean and standard deviation of this normal distribution. Because the population proportion p is unknown, we must instead use the sample proportion p s, which is defined to be the number of success in the sample, divided by the sample size n. Several samples can be taken, and then their proportion means can be averaged to obtain an approximate value for p. The standard deviation of this distribution, called the standard error of the proportion, is given by p(1 p) σ p =. n James V. Lambers Statistical Data Analysis 27 / 30

28 Relation to the Binomial Distribution It is worth noting that σ p is equal to the standard deviation of the binomial distribution, np(1 p), divided by n. This makes sense because in the sampling distribution of the proportion, we are not measuring the number of successes, as we are in the binomial distribution. Rather, we are measuring the proportion of successes, thus requiring the division of both the binomial distribution s mean and standard deviation by n. James V. Lambers Statistical Data Analysis 28 / 30

29 Example Suppose that through sampling, with samples of size n = 100, it is determined that 60% of voters in California support a particular ballot initiative (that is, p = 0.6). Because np = 100(0.6) = 60 and n(1 p) = 100(0.4) = 40 are large enough, we may use a normal distribution to model the sampling distribution of the proportion. James V. Lambers Statistical Data Analysis 29 / 30

30 Example, cont d The standard error of the proportion is 0.6(1 0.6) σ p = = Therefore, the probability that more than 65% of the next sample will support the initiative is P(p s > 0.65) = 1 P(p s 0.65) = 1 P(Z 1.02) = = , where the z-score for 0.65 is 0.65 p = = σ p James V. Lambers Statistical Data Analysis 30 / 30

Introduction to Statistical Data Analysis Lecture 5: Confidence Intervals

Introduction to Statistical Data Analysis Lecture 5: Confidence Intervals James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1