ST 371 (IX): Theories of Sampling Distributions

ST 371 (IX): Theories of Sampling Distributions 1 Sample, Population, Parameter and Statistic The major use of inferential statistics is to use information from a sample to infer characteristics about a population. A population is the complete collection of subjects to be studied; it contains all subjects of interest. A sample is a part of the population of interest, a sub-collection selected from a population. A parameter describes a characteristic of a population, while a statistic describes a characteristic of a sample. In general, we will use a statistic to infer the value of a parameter. Unbiased Sample: A sample is unbiased if every individual or the element in the population has an equal chance of being selected. Next we discuss several examples occurred in survey sampling. 1. Survey in presidential election. (a) Option I: Call all registered voters on the phone and ask them who they will vote for. Although this would provide a very accurate result, it would be a very tedious and time consuming project. (b) Option II: Call 4 registered voters,1 in each time zone, and ask them who they will vote for. Although this is a very easy task, the results would not be very reliable. (c) Option III: Randomly select 20,000 registered voters and poll them. The population of interest here is all registered voters, and the parameter is the percentage of them that will vote for a candidate. The sample is the 20,000 registered voters that were polled, and the statistic is the percentage of them that will vote for a candidate. 2. Kathy wants to know how many students in her city use the internet for learning purposes. She used an email poll. Based on the replies 1

to her poll, she found that 83% of those surveyed used the internet. Kathys sample is biased as she surveyed only the students those who use the internet. She should have randomly selected a few schools and colleges in the city to conduct the survey. 3. Another classic example of a biased sample and the misleading results it produced occurred in 1936. In the early days of opinion polling, the American Literary Digest magazine collected over two million postal surveys and predicted that the Republican candidate in the U.S. presidential election, Alf Landon, would beat the incumbent president, Franklin Roosevelt by a large margin. The result was the exact opposite. The Literary Digest survey represented a sample collected from readers of the magazine, supplemented by records of registered automobile owners and telephone users. This sample included an overrepresentation of individuals who were rich, who, as a group, were more likely to vote for the Republican candidate. In contrast, a poll of only 50 thousand citizens selected by George Gallup s organization successfully predicted the result, leading to the popularity of the Gallup poll. Conclusion: To use a sample to make inferences about a population, the sample should be representative of the population (unbiased). 2 Statistics and their Distributions A statistic is a random variable, denoted by an upper case letter whose value can be computed from sample data. We often use a statistic to infer the value of a parameter. Examples include Measures of location: Suppose we observe n realizations of random variable X: x 1,, x n, the sample mean is x = 1 n n i=1 x i. In contrast, the population mean is E(X) = µ. 2

the sample median: let x (1),, x (n) denote the ordered values. If n is odd, then x = x ( n+1 2 ). If n is even, x = 1/2[x ( n 2 ) + x ( n 2 +1) ]. In contrast, the population median is µ = FX 1 (0.5). Measure of variability: the sample variance S 2 = 1 n 1 Note that the population variance is n (x i x) 2. i=1 σ 2 = V (X) = E(X µ) 2. Measure of contrasts: Consider random samples from two populations {x 1,, x n } and {y 1,, y m }, for example, in a randomized clinical trial, the difference of the quality of life (QOL) between the patients (or survival time, or cure rate) on two treatment arms T = x ȳ. The contrast between two populations is µ X µ Y = E(X) E(Y ). Each statistic is a random variable and has a probability distribution. The probability distribution of a statistic is referred to as its sampling distribution. The sampling distribution depends not only on the population distribution but also on the method of sampling. The most widely used sampling method is random sampling with replacement. The random variables X 1,, X n are said to form a random sample of size n, or be independently identically distributed (i.i.d.), if 1. The X i s are independent rv s. 2. Every X i has the same probability distribution. Denote by µ and σ 2 the mean and variance of the random variable X. The next theorem follows from the results on the distribution of a linear combination that we shall discuss in Section 4. 3

Theorem on the distribution of the sample mean X. 1. E( X) = µ X = µ. 2. V( X) = σ 2 X = σ 2 /n. 3. σ X = σ/ n. Example 1 Let X 1,, X 5 be a random sample from a normal distribution with µ = 1.5 and σ = 0.35. Find the probability that P ( X 2.0). Find the variance of 5 i=1 X i. 4

Example 2 Service time for a certain bank transaction is a random variable having an exponential distribution with parameter λ. Suppose X 1 and X 2 are service times for two independent customers. Consider the average service time X = (X 1 + X 2 )/2. Find the cdf of X. Find the pdf of X. Find the mean and variance of X. 5

3 Limit Theorems 3.1 Weak law of large numbers Consider a sample of independent and identically distributed random variables X 1,, X n. The relationship between the sample mean X n = X 1 + + X n n and true mean of the X i s, E(X i ) = µ, is a problem of pivotal importance in statistics. Typically, µ is unknown and we would like to estimate µ based on X n. The weak law of large numbers says that the sample mean converges in probability to µ. This means that for a large enough sample size n, Xn will be close to µ with high probability. The weak law of large numbers. Let X 1, X 2, be a sequence of independent and identically distributed random variables, each having finite mean E(X i ) = µ. Then, for any ɛ > 0, (3.1) P { Xn µ ε } 0 as n. Example 3 A numerical study of the law of large numbers. We first simulate normal random variables from N(5, 1) with different sample sizes, then calculate the difference between the sample mean and the population mean. n 5 20 500 10000 50000 Bias: Xn µ 0.8323-0.1339 0.0368 0.0069-0.0092 We can see that X n based on a large n tends to be closer to µ than does X n based on a small n. Example 4 (optional) Application of Weak Law of Large Numbers: Monte Carlo Integration. Suppose that we wish to calculate I(f) = 1 0 6 f(x)dx,

where the integration cannot be done by elementary means or evaluated using tables of integrals. The most common approach is to use a numerical method in which the integral is approximated by a sum; various schemes and computer packages exist for doing this. Another method, called the Monte Carlo method, works in the following way. Generate independent uniform random variables on (0,1), that is, X 1,, X n, and compute I( ˆf) = 1 n n f(x i ). By the law of large numbers, for large n, this should be close to E[f(X)], which is simply E[f(X)] = 1 0 i=1 f(x)dx = I(f). This simple scheme can easily be modified in order to change the range of integration and in other ways. Compared to the standard numerical methods, it is not especially efficient in one dimension, but becomes increasingly efficient as the dimensionality of the integral grows. 3.2 Strong law of large numbers (optional) The strong law of large numbers states that for a sequence of independent and identically distributed random variables X 1, X 2,, the sample mean converges almost surely to the mean of the random variables E(X i ) = µ. Let be a sequence of independent and identically distributed random variables, each having a finite mean µ = E(X i ). Then, with probability 1, X µ as n. The weak law of large numbers states that for any specified large value n, Xn is likely to be near µ. However, it does not say that X n is bound to stay near µ for all values of n larger than n. Thus, it leaves open the possibly that large values of X n µ can occur infinitely often (thought at infrequent intervals). The strong law shows that this cannot occur. In particular, it implies that with probability 1, for any positive value ɛ, X n µ will be greater than ɛ only a finite number of times. 7

The strong law of large numbers is of enormous importance, because it provides a direct link between the axioms of probability and the frequency interpretation of probability. If we accept the interpretation that with probability 1 means with certainty, then we can say that P (E) is the limit of the long-run relative frequency of times E would occur in repeated, independent trials of the experiment. 3.3 Central limit theorem The weak law of large numbers says that for X 1,, X n, iid, the sample mean X n is close to E(X i ) = µ when n is large. The Central Limit Theorem provides a more precise approximation by showing that a magnification of the distribution of Xn around µ has approximately a standard normal distribution: The Central Limit Theorem (CLT): Let X 1,, X n be a sequence of independent and identically distributed random variables, each having finite mean E(X i ) = µ and finite variance Var(X i ) = σ 2. Then the distribution X of n µ tends to the standard normal distribution as n. That is, for σ/ n < a <, ( X1 + + X n nµ P σ n ) a 1 a e x2 /2 dx 2π as n. The theorem can be thought of as roughly saying that the sum of a large number of iid random variables has a distribution that is approximately normal. By writing X 1 + + X n nµ σ n = n ( Xn µ ) (σ n) = X n µ σ/ n, we see that the CLT says that the sample mean X n has a approximately a normal distribution with mean µ and variance σ/ n. The CLT is a remarkable result - only assuming that a sequence of iid random variables have a finite mean and variance, the central limit theorem shows that the mean of the sequence, suitably standardized, always converges to having a 8

standard normal distribution. The normal approximation to the binomial distribution is a special case of the central limit theorem. Consider a skewed distribution (lognormal). Consider the histogram of the sample mean X n for n = 1, 5, 20, 50. n=1 n=5 Frequency 0 500 1000 1500 Frequency 0 200 400 600 800 0 1 2 3 4 x1.bar 0.0 0.5 1.0 1.5 x2.bar n=10 n=30 Frequency 0 200 600 1000 Frequency 0 200 400 600 800 0.2 0.4 0.6 0.8 1.0 1.2 1.4 x3.bar 0.2 0.4 0.6 0.8 1.0 x4.bar We can see from the histograms that the sampling distributions become progressively less skewed as the sample size n increases, therefore the distribution can be better approximated by a normal distribution. This interesting result shows that the central limit theorem can be successfully applied when n is large. In general, the rule of thumb is n > 30. 9

Example 5 An Airline overbooks a flight because it expects that there will be no-shows. Assume that (i) There are 200 seats available on the flight. (ii) Seats are occupied only by individuals who made reservations (no standbys). (iii) The probability that a person who made a reservation shows up for the flight is 0.95. (iv) Reservations show up for the flight independently of each other. 1. If the airline accepts 220 reservations, write an expression for the exact probability that the plane will be full (i.e., at least 200 reservations show up). Use the central limit theorem to approximate this probability. 2. Suppose the airline wants to choose a number n of reservations so that the probability that at least 200 of the n reservations show up is 0.75. Find the (approximate) minimum value of n. 10

Example 6 The number of parking tickets issued in Raleigh on any given weekday has a Poisson distribution with parameter λ = 50. What is the approximate probability that (a) Between 35 and 70 tickets are given out on a particular day? (b) The total number of tickets given out during a 5-day week is between 225 and 275? 11

4 Distribution of a Linear Combination Given a collection of n random variables X 1,, X n and n numerical constants a 1,, a n, the rv Y = a 1 X 1 + + a n X n = n a i X i i=1 is called a linear combination of the X i s. Let X 1,, X n have means µ 1,, µ n, respectively, and variances σ 2 1,, σ 2 n, respectively. Then 1. E(a 1 X 1 + a 2 X 2 + + a n X n ) = a 1 E(X 1 ) + a 2 E(X 2 ) + + a n E(X n ) = a 1 µ 1 + + a n µ n. 2. If X 1, X 2,, X n are independent, then Var(a 1 X 1 +, a n X n ) = a 2 1Var(X 1 )+ +a 2 nvar(x n ) = a 2 1σ 2 1+ +a 2 nσ 2 n. 3. For any (possibly dependent) random variables X 1,, X n, Var(a 1 X 1 + + a n X n ) = n i=1 n a i a j Cov(X i, X j ). j=1 The case of normal random variables: If X 1,, X n are independent, normally distributed rv s, then any particular linear combination of the X i s are also normally distributed. Special cases: 1. E( X) = µ X = µ. 2. If all X i are independent, V( X) = σ 2 X = σ 2 /n. 3. E(X 1 X 2 ) = E(X 1 ) E(X 2 ). 4. If X i are independent, then V (X 1 X 2 ) = V (X 1 ) + V (X 2 ). Otherwise V (X 1 X 2 ) = V (X 1 ) + V (X 2 ) 2Cov(X 1, X 2 ). 12

Example 7 The total revenue from the sale of the three grades of gasoline on a particular day was Y = 21.2X 1 +21.35X 2 +21.5X 3. Assume that X 1, X 2 and X 3 are independent with µ 1 = 1000, µ 2 = 500, µ 3 = 300, σ 1 = 100, σ 2 = 80 and σ 3 = 50. What is the probability that the revenue exceeds 45000? 13

Example 8 A student has a class that is supposed to end at 9am and another class that is supposed to begin at 9:10am. Suppose that the actual ending time (after 9 in minutes) X 1 N(2, 1.5 2 ) and the starting time of the next class X 2 N(10, 1 2 ). Suppose also that the time to get from one location to next location X 3 N(6, 1 2 ). What is the probability that a student makes it to the second class before the lecture starts. 14

Example 9 Three different roads feed into a particular freeway entrance. Suppose that during a fixed time period, the number of cars coming from each road onto the freeway X i is normally distributed, with X 1 N(750, 16 2 ), X 2 N(1000, 24 2 ) and X 3 N(550, 18 2 ). (a). What is the expected total number of cars entering the freeway at this point during the period? (b). Suppose X 1, X 2 and X 3 are independent. Find the probability P (X 1 + X 2 + X 3 > 2500). (c). Now suppose that the three streams of traffic are not independent, and Cov(X 1, X 2 ) = 80, Cov(X 1, X 3 ) = 90 and Cov(X 2, X 3 ) = 100. Compute the expected value and variance of the total number of entering cars. 15